Question: How can I learn cluster computing?
gravatar for sviatoslav.kendall
5.8 years ago by
United States
sviatoslav.kendall770 wrote:

I have access to my institutions super-computer and I recognize that knowing how to use a cluster-computing environment is a valuable skill for a bioinformatician, but I do not know how to go about learning to use one.


I imagine there must be some good tutorials out there that I could use to learn the basics. Can some point me in the right direction?

rna-seq next-gen genome • 2.3k views
ADD COMMENTlink modified 5.8 years ago by Dan D7.1k • written 5.8 years ago by sviatoslav.kendall770

Are you sure that your institution doesn't have a tutorial session or workshop series? They almost always do, because it keeps questions like these from flooding their inbox :)

What cluster software is it running? I may have some notes handy. You might be able to infer the cluster management software by typing one of the following on the command line:

man bsub

man msub

man qsub

Let me know if one of those commands gives you a manpage.

ADD REPLYlink modified 5.8 years ago • written 5.8 years ago by Dan D7.1k

I'll add man sbatch to that list.

ADD REPLYlink written 5.8 years ago by Devon Ryan96k

That would be my first suggestion too; get in touch with the IT people, ask about courses or online material. They usually provide at least some basic guides, in the hope that fewer people will break their system :)

ADD REPLYlink written 5.8 years ago by Neilfws48k

Asking internal IT first is necessary, and easiest, also to learn how LSF (or whatever platform is in use) has been set up and some options might be made mandatory. For example, to submit a job it might be as easy as bsub < but probably you need to say how much memory, run time you want.

The man pages are certainly authoritative and worth referring to but they might give the impression that submitting jobs is more complicated than it actually is in practice!

ADD REPLYlink written 5.8 years ago by dariober11k

man pages for LSF are horrible!

ADD REPLYlink written 5.8 years ago by brentp23k

Glad to hear I'm not the only one! Between that and man curl it's a tough competition.

ADD REPLYlink written 5.8 years ago by dariober11k

They do offer such courses but somewhat infrequently and I just missed the last one.

ADD REPLYlink written 5.8 years ago by sviatoslav.kendall770

Bring the IT folks coffee and/or beer and I bet they'll give you the quick version of the course (or at the very least give you the slides).

ADD REPLYlink written 5.8 years ago by Devon Ryan96k

man bsub brings up a page about "LSF jobs"

man qsub brings up a page about  "PBS job"

I guess they've got both types of cluster software. Found a couple of tutorials online but would still be happy to take a look at any notes you have to share. 

ADD REPLYlink modified 5.8 years ago • written 5.8 years ago by sviatoslav.kendall770

do you know how to use linux and bash? If so, then using a queuing system is a relatively small step. Usually if you can do:

echo "some long command" | bash

then it can run as:

echo "some command" | qsub -e msg.err -o msg.out

and you simply have to learn about 5 common flags to reserve the correct number of CPU's and amount of memory.

ADD REPLYlink modified 11 months ago by RamRS30k • written 5.8 years ago by brentp23k

Maybe not a technical skill, but you should learn good practices and common courtesies. Always benchmark your programs/tasks for memory usage and CPU usage efficiency. Usually your goal is to either decrease the wallclock time needed to perform some task, or utilize multiple nodes to overcome some hardware limitation (e.g. memory). You should always look for ways to achieve these goals while utilizing your hardware as efficiently as possible.

A few pointers:

  1. Don't clog nodes/queue up with terrible scripts. Sometimes it can't be avoided but if you do it all the time there will be people looking for a length of steel pipe if they see you have 1200 24 core nodes each running a single threaded perl script that takes 13 hours and the queue is filled with 3k more of these. It is worth the effort at times to make an initial investment in performance, you'll save walltime and people won't hate you.

  2. If your system is heterogeneous, with differently groups of nodes that have different amounts of cores/memory, or if there are different interconnects, be mindful of what you use. Don't run 32 1GB memory python scripts on a 32 core node with 500GB of ram. If you're not using MPI, avoid using nodes with Infiniband/Infinipath ICs.

  3. Be mindful of the funding sources used to build the machine. Sometimes a department or the university/facility will pay for the whole cluster. In other cases the system is paid for in parts from a number of different labs/groups. In this case if you have access to the nodes, use them, but be mindful about who paid for what. When in doubt, ask if your jobs are causing problems.

  4. Always do small test runs before production. Try a single job on a node or two and make sure it is working. If you're having problems with production runs, move back to a small testing size. Don't sit there submitting huge numbers of jobs if you're troubleshooting or still developing.

  5. Each cluster is different, the hardware, software, admin/IT support, number of users and the common types of usage. It is always useful to remember what short cuts you can and can't take because of the features specific to your system. The types/level of usage and how your batch system schedules things can impact how you can best run jobs. Sometimes it is faster to have a few nodes do more work rather than have jobs sitting in the queue waiting for nodes to open up. The amount of hardware can impact how you program, if you have tons of ram you can be lazier about memory management and how you load data. This can come back to hurt you later if you relax too much.

  6. Pay attention to how the file system is set up and how/what is backed up. What nodes can see what directories/file systems? Are there differences in the types/speed of the drives used? How often are directories backed up? What is the maximum size of the snapshots that can be taken? Are daily backups different sizes than monthly/weekly?

  7. This isn't usually the case for academic settings, but it may be the case that you're paying per unit of usage. Either wall time, cpu time and/or storage.

  8. Efficiency, not speed up, is what you're after. Don't use twice as many cores if it only saves you a few hours of runtime. If your jobs are relatively fast, don't use huge numbers of nodes just to save time. Wait a bit longer. Even embarrassingly parallel problems can stop scaling once you hit up against other hardware limits.

In general you want to be more prudent than usual, HPCs are great, you can do huge amounts of stuff in parallel but just remember that "stuff" can mean getting work done, or it can mean "creating a disaster". Though not permanent, it isn't fun to find out that your 1200 jobs run in parallel made 1200 messes.

In addition to learning the typical batch systems, you may want to explore the various tools and means of parallelization your cluster has to offer. Everything from relatively simple tools like GNU parallel, to software specific parallelization (e.g. MATLAB), to simple code based approaches (e.g. R's snow package) to more complex (e.g. MPI). You may need to use these to develop custom tools, or you may need to know how they work for using software from others (e.g. MrBayes).

As others have stated, it isn't difficult to get started if you're already familiar with shell/*nix environments and can program. You may want to see if your school/company/entity has courses or classes on HPC. It can be very useful as the classes typically use whatever system you'll be working on. They'll also go more in depth into different areas of HPC/parallel computing which can be useful in the long run. I've found that they're pretty good places for networking and developing contacts you can approach with technical questions.

ADD REPLYlink modified 11 months ago by RamRS30k • written 5.8 years ago by pld4.8k
gravatar for Dan D
5.8 years ago by
Dan D7.1k
Dan D7.1k wrote:

I worked at Vanderbilt for several years, and their compute cluster team was fantastic. They did weekly workshops to teach new users how to properly submit jobs. These slides will hopefully be very helpful to you. They'll show you how to construct a job submission, query for existing jobs, and check resource availability. Skip to slide 16:

Based on the comments I think your cluster is using PBS/TORQUE, so hopefully those slides will be applicable. To check, just make a quick shell script job submission and see if it successfully executes after you qsub it.

ADD COMMENTlink modified 5.8 years ago • written 5.8 years ago by Dan D7.1k
gravatar for Vivek
5.8 years ago by
Vivek2.4k wrote:

Most commonly used schedulers are LSF & SunGrid Engine (SGE). If you search for LSF + tutorial or Sun Grid Engine + tutorial you'll find links to a bunch of quick start guides at various university webpages.

ADD COMMENTlink written 5.8 years ago by Vivek2.4k
gravatar for Ron
5.8 years ago by
United States
Ron1000 wrote:

I think this is a very good resource

ADD COMMENTlink written 5.8 years ago by Ron1000
gravatar for 873243
5.8 years ago by
8732430 wrote:

SLURM is a highly modular and scalable resource manager for clusters widely used (note from comment above man sbatch for slurming a script into a cluster). You can send an executable program to be run in the cluster.

ADD COMMENTlink modified 5.8 years ago • written 5.8 years ago by 8732430
gravatar for 5heikki
5.8 years ago by
5heikki9.0k wrote:

If you know how to use a shell, you're already there. Just a few more utils that you need to master like ssh, scp and qsub. If you don't, well, that (shell) is all the practice you need..

ADD COMMENTlink written 5.8 years ago by 5heikki9.0k
gravatar for Pierre Lindenbaum
5.8 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum130k wrote:

I first met SGE/OGE, and people told me to RTFM.

So many things to learn and to understand.

I then learned that there is a parallel implementation of GNU-make for SGE (and slurm)

Just run

qmake -j 10

instead of

make -j 10

no more problem.

see also: How To Organize A Pipeline Of Small Scripts Together? , Standard simple format to describe a bioinformatics analysis pipeline ...

ADD COMMENTlink modified 11 months ago by RamRS30k • written 5.8 years ago by Pierre Lindenbaum130k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1013 users visited in the last hour