Node requirement on a cluster for bioinformatics analysis
4
1
Entering edit mode
7.8 years ago
halo22 ▴ 300

We are in the process of developing analysis pipelines for WGS,RNA and Methyl-seq data. The first projects considers 100 patients with a common disease. All the analysis would be run on data from these patients. Our institute's computational center is offering us 5 nodes (24 CPU core, 2.6GHz clock speed, 128GB RAM memory each), I believe that necessary storage space would also be provided to us. Based on your experiences would you please tell me if having 5 nodes is going to be sufficient, I understand that there cannot be one single answer to this question but any help from you all is much appreciated.

alignment genome next-gen • 1.7k views
ADD COMMENT
2
Entering edit mode
7.8 years ago
GenoMax 141k

Since you appreciate that there can't be one right answer here you are off to a good start.

Your requirements will/may change over time and you should build some flexibility into the allocation to access nodes with better specs (mainly RAM) in case you find that you are unable to complete some specific analysis. This recommendation assumes that no de novo assembly work would be involved since the specs (mainly memory) would be insufficient for that type of analysis.

ADD COMMENT
2
Entering edit mode
7.8 years ago
pld 5.1k

The two big things are

1) How much ram does a single instance of the "largest" program you'll run need?

This is the only hard requirement. When I've run trinity I've needed 1TB of ram before. If I didn't have a machine with 1TB of RAM, I couldn't have run Trinity. If the program with the highest memory consumption will fit on 124gb, then that's fine.

2) How long are you willing to wait for things to run?

More nodes means you can run more stuff at once. You don't really need them. Figure out which is more important to you, time or money? Spend more money, get more nodes, higher throughput. Save money, get fewer nodes, spend more time waiting for results.

ADD COMMENT
2
Entering edit mode
7.8 years ago
Sinji ★ 3.2k

Technically one node would be sufficient to run these types of Pipelines (never worked with WGS admittedly) as long as the programs / software you are using does not require more than 128gb of RAM. You could process each sample one by one and get everything done, although it may be slow. 5 nodes is useful is you're able to use them effectively.

Wait for a reply from someone that has more experience in using clusters, they should be able to help you with the more gritty details of how to effectively use the five nodes. Short answer to your question is yes.

ADD COMMENT
1
Entering edit mode

Task distribution using SGE/LSF/etc is probably sufficient to utilize the five nodes. In plain language, make 5 shell scripts, tell each one to use 24 cores, and submit them.

ADD REPLY
1
Entering edit mode
7.8 years ago
chen ★ 2.5k

It depends on how many days your project need to be completed in.

But to my opinion, 128GB RAM is not enough :)

ADD COMMENT

Login before adding your answer.

Traffic: 1511 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6