How To Monitor Shared Computing Resources?
3
11
Entering edit mode
12.2 years ago

We have a handful of large servers that are shared by a number of users for NGS data analysis. Currently there is no scheduling system setup on these machines. Users are generally expected to 'play nice' with others on a particular system.

We are looking for ways to better monitor the usage of these machines. Specifically, we would like to know both short-term usage:

  • UserA is consuming 50% of the cpu power on BoxB right now

And also (perhaps more importantly) long-term usage trends:

  • In the past 6 months, on average we are using 20% of BoxB's CPU power and 40% of its RAM. UserC is the top CPU user at 50% total usage.

How are other groups answering these monitoring questions?

I saw a few related BioStar questions:

But so far, nothing on how these resources are monitored.

We have Ganglia and Nagios setup for these servers - but nobody has yet taken the time to configure them much past their defaults.

Are other groups using Graphite? Is it worth investigating over Ganglia? (The Etsy programmers sure make it sound good).

These tools also don't seem to be very good for long term trends (please correct me if I'm wrong). What are people using to monitor resources to know when processing power is being saturated?

Thanks!

next-gen sequencing server • 3.4k views
ADD COMMENT
1
Entering edit mode

Do you use a batch job submission system like PBS (maui)?

ADD REPLY
0
Entering edit mode

We do not use any batch job submission. Each user logs in and runs code in an unscheduled and unrestricted manner.

Also, just to clarify, I'm not talking about a cluster - but individual servers (the BAS concept mentioned in one of the related posts and discussed here: http://jermdemo.blogspot.com/2011/06/big-ass-servers-and-myths-of-clusters.html

ADD REPLY
0
Entering edit mode

there is a clear need for apps that sit in the spectrum between "top" and "valgrind". Unix is much better at dealing with CPU contention than memory and yet memory stats are often totally inaccurate.

ADD REPLY
0
Entering edit mode

This question is somewhat borderline for BioStar; more sysadmin than bioinformatics. You might want to ask at http://serverfault.com/.

ADD REPLY
3
Entering edit mode
12.2 years ago
Gjain 5.8k

Hi Jim,

Sun Grid Engine is worth looking at. It has great capability for scheduling and monitoring these kinds of resources. Our cluster computing facility uses this to schedule jobs and monitor jobs for fair usage.

ADD COMMENT
2
Entering edit mode

This question isn't about scheduling in the SGE sense, it's about how to monitor the need for more shared servers that are not part of a cluster. The users on these machines are often writing one off scripts, and executing ad hoc pipelines here and there.

ADD REPLY
1
Entering edit mode

You should consider SGE or something similar to manager resources on your servers. You could configure an SGE scheduler on each server which submits jobs only to that server, i.e. each server acts as both a queue master and execution host.

As a sysadmin, I've seen too many problems with boxes being overloaded because users think there are sufficient resources available.

ADD REPLY
0
Entering edit mode

Thanks for the suggestion. Would you recommend this kind of infrastructure even for individual large servers - not in a cluster environment?

ADD REPLY
0
Entering edit mode

Yes. In addition to scheduling, SGE allows monitoring system resource usage over time (qacct) and also job control.

ADD REPLY
0
Entering edit mode

Also very important, with SGE or other queueing systems, is the queue policy. I mean if SGE checks continously all jobs are queued or only first come first served and so no

ADD REPLY
3
Entering edit mode
12.2 years ago
seidel 11k

I hate to state the obvious, but if nagios and ganglia are already in place then they should be configured to monitor the things of interest. For long term trends, couldn't the information from nagios and ganglia be scraped by a custom or 3rd party script? In particular, you might want to think of formulating some metrics that make sense for the environment. For a variety of users doing different things on these machines, some ad hoc scripts, some more mature pipelines, when they log onto one of the machines to see if it's available for their big ass script, what are they looking for? How often will they decide not to run their script and instead try a different machine? (My low tech solution to a similar problem was to keep a jar by my terminal and every time I decided a particular machine was not available, I put a quarter in the jar. After x amount of time the evidence stacked up and was an obvious call for more resources.) I think the data can be collected by programs like nagios, but the actual metric of interest might take some thought and formulation from that data.

ADD COMMENT
0
Entering edit mode

Certainly that would be a start. I guess I am hoping to hear the specifics on what other groups are doing, as this issue must come up for others.

Also, if there are any tools that already provide the ability to formulate these kinds of metrics easily. Hacking together a script to tease out data from Ganglia is fine - if there is actually nothing better out there for this kind of problem.

ADD REPLY
0
Entering edit mode
10.9 years ago
Yannick Wurm ★ 2.5k

Jim: psacct keeps tally of which users used how much CPU/RAM. (without any need for a queueing system)

ADD COMMENT

Login before adding your answer.

Traffic: 2637 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6