Question

How To Monitor Shared Computing Resources?

11

Entering edit mode

12.2 years ago

Jim Vallandingham ▴ 350

We have a handful of large servers that are shared by a number of users for NGS data analysis. Currently there is no scheduling system setup on these machines. Users are generally expected to 'play nice' with others on a particular system.

We are looking for ways to better monitor the usage of these machines. Specifically, we would like to know both short-term usage:

UserA is consuming 50% of the cpu power on BoxB right now

And also (perhaps more importantly) long-term usage trends:

In the past 6 months, on average we are using 20% of BoxB's CPU power and 40% of its RAM. UserC is the top CPU user at 50% total usage.

How are other groups answering these monitoring questions?

I saw a few related BioStar questions:

BAS costs: http://biostar.stackexchange.com/questions/16129/big-ass-servers-storage
Blast Machine specs: http://biostar.stackexchange.com/questions/9782/machine-spec-for-running-a-blast-service-for-50-users/9798
NGS workstations: http://biostar.stackexchange.com/questions/8246/workstations-for-ngs-analysis/8250

But so far, nothing on how these resources are monitored.

We have Ganglia and Nagios setup for these servers - but nobody has yet taken the time to configure them much past their defaults.

Are other groups using Graphite? Is it worth investigating over Ganglia? (The Etsy programmers sure make it sound good).

These tools also don't seem to be very good for long term trends (please correct me if I'm wrong). What are people using to monitor resources to know when processing power is being saturated?

Thanks!

next-gen sequencing server • 3.4k views

ADD COMMENT • link updated 10.9 years ago by Yannick Wurm ★ 2.5k • written 12.2 years ago by Jim Vallandingham ▴ 350

1

Entering edit mode

Do you use a batch job submission system like PBS (maui)?

ADD REPLY • link 12.2 years ago by Niallhaslam 2.3k

0

Entering edit mode

We do not use any batch job submission. Each user logs in and runs code in an unscheduled and unrestricted manner.

Also, just to clarify, I'm not talking about a cluster - but individual servers (the BAS concept mentioned in one of the related posts and discussed here: http://jermdemo.blogspot.com/2011/06/big-ass-servers-and-myths-of-clusters.html

ADD REPLY • link updated 4.5 years ago by Ram 43k • written 12.2 years ago by Jim Vallandingham ▴ 350

0

Entering edit mode

there is a clear need for apps that sit in the spectrum between "top" and "valgrind". Unix is much better at dealing with CPU contention than memory and yet memory stats are often totally inaccurate.

ADD REPLY • link 12.2 years ago by Jeremy Leipzig 22k

0

Entering edit mode

This question is somewhat borderline for BioStar; more sysadmin than bioinformatics. You might want to ask at http://serverfault.com/.

ADD REPLY • link 12.2 years ago by Neilfws 49k

score 3 · Answer 1 · 2012-02-01

3

Entering edit mode

12.2 years ago

Gjain 5.8k

Hi Jim,

Sun Grid Engine is worth looking at. It has great capability for scheduling and monitoring these kinds of resources. Our cluster computing facility uses this to schedule jobs and monitor jobs for fair usage.

ADD COMMENT • link 12.2 years ago by Gjain 5.8k

2

Entering edit mode

This question isn't about scheduling in the SGE sense, it's about how to monitor the need for more shared servers that are not part of a cluster. The users on these machines are often writing one off scripts, and executing ad hoc pipelines here and there.

ADD REPLY • link 12.2 years ago by seidel 11k

1

Entering edit mode

You should consider SGE or something similar to manager resources on your servers. You could configure an SGE scheduler on each server which submits jobs only to that server, i.e. each server acts as both a queue master and execution host.

As a sysadmin, I've seen too many problems with boxes being overloaded because users think there are sufficient resources available.

ADD REPLY • link 12.2 years ago by Tom Walsh ▴ 550

0

Entering edit mode

Thanks for the suggestion. Would you recommend this kind of infrastructure even for individual large servers - not in a cluster environment?

ADD REPLY • link 12.2 years ago by Jim Vallandingham ▴ 350

0

Entering edit mode

Yes. In addition to scheduling, SGE allows monitoring system resource usage over time (qacct) and also job control.

ADD REPLY • link 12.2 years ago by Sean Davis 26k

0

Entering edit mode

Also very important, with SGE or other queueing systems, is the queue policy. I mean if SGE checks continously all jobs are queued or only first come first served and so no

ADD REPLY • link 12.2 years ago by Flow ★ 1.5k

score 3 · Answer 2 · 2012-02-01

I hate to state the obvious, but if nagios and ganglia are already in place then they should be configured to monitor the things of interest. For long term trends, couldn't the information from nagios and ganglia be scraped by a custom or 3rd party script? In particular, you might want to think of formulating some metrics that make sense for the environment. For a variety of users doing different things on these machines, some ad hoc scripts, some more mature pipelines, when they log onto one of the machines to see if it's available for their big ass script, what are they looking for? How often will they decide not to run their script and instead try a different machine? (My low tech solution to a similar problem was to keep a jar by my terminal and every time I decided a particular machine was not available, I put a quarter in the jar. After x amount of time the evidence stacked up and was an obvious call for more resources.) I think the data can be collected by programs like nagios, but the actual metric of interest might take some thought and formulation from that data.

score 0 · Answer 3 · 2013-05-19

0

Entering edit mode

10.9 years ago

Yannick Wurm ★ 2.5k

Jim: psacct keeps tally of which users used how much CPU/RAM. (without any need for a queueing system)

ADD COMMENT • link 10.9 years ago by Yannick Wurm ★ 2.5k