We have a handful of large servers that are shared by a number of users for NGS data analysis. Currently there is no scheduling system setup on these machines. Users are generally expected to 'play nice' with others on a particular system.
We are looking for ways to better monitor the usage of these machines. Specifically, we would like to know both short-term usage:
UserA is consuming 50% of the cpu power on BoxB right now
And also (perhaps more importantly) long-term usage trends:
In the past 6 months, on average we are using 20% of BoxB's CPU power and 40% of its RAM. UserC is the top CPU user at 50% total usage.
How are other groups answering these monitoring questions?
Sun Grid Engine is worth looking at. It has great capability for scheduling and monitoring these kinds of resources. Our cluster computing facility uses this to schedule jobs and monitor jobs for fair usage.
I hate to state the obvious, but if nagios and ganglia are already in place then they should be configured to monitor the things of interest. For long term trends, couldn't the information from nagios and ganglia be scraped by a custom or 3rd party script? In particular, you might want to think of formulating some metrics that make sense for the environment. For a variety of users doing different things on these machines, some ad hoc scripts, some more mature pipelines, when they log onto one of the machines to see if it's available for their big ass script, what are they looking for? How often will they decide not to run their script and instead try a different machine? (My low tech solution to a similar problem was to keep a jar by my terminal and every time I decided a particular machine was not available, I put a quarter in the jar. After x amount of time the evidence stacked up and was an obvious call for more resources.) I think the data can be collected by programs like nagios, but the actual metric of interest might take some thought and formulation from that data.