Question: Experiences with large datasets
1
gravatar for ATpoint
2.1 years ago by
ATpoint21k
Germany
ATpoint21k wrote:

We are currently downloading and analyzing multiple large WGS datasets (30-50x) from patients. So far, we downloaded the data of 20 patients from dbGaP/NCBI (tumor and matched normal respectively). More samples are planned to be included. The download itself via prefetch/fasp was relatively fast and smooth but now the problems begin, so maybe you have some experience in how to optimize things.

-Sra to fastq via fastq-dump is often unbearably slow. Not only is the fastq-dump slow itself, but I often experience I/O bottlenecks on our university cluster, which uses gpfs (not lustre as I stated yesterday). fastq-dump is often stuck in "D"-state, so uninterruptable sleep. To speed up things, I dumped large sra into several fastq files, using the -N and -X options, but merging these chunks via GNU cat was also extremely slow, sometimes with only a few hundred MB in several hours. Is that normal (the server does not run on SSD as far as I know).

-same goes for alignment sorting. I tried to rather use fewer threads with SAMtools sort but more memory per thread to avoid the creation of too many tmp files, which then need to be merged again. still, even merging few (< 50) files takes hours and hours of time, again with only few hundreds MB in several hours. That often collides with the walltimes.

It would be great if you could share your experiences in how to handle these Terabyte-scale data, and what tricks one can apply in order to avoid performance bottlenecks.

UPDATE: It seems that the main bottleneck is reading the files from disk, rather than writing them after being processed.

linux big data alignment wgs server • 1.1k views
ADD COMMENTlink modified 18 months ago by sutturka150 • written 2.1 years ago by ATpoint21k
2

Have you considered the possibility that your account may simply be getting throttled having used too many resources (if accounting is enabled on this cluster, which it likely is) in a said accounting period. A cluster is, by design, meant for "fair share". Once you go over a certain amount of resource usage (CPU hours etc per accounting period) your jobs may be receiving the least priority.

That said you should always (where possible) get the fastq files directly from EBI-ENA and avoid sratoolkit/sra files.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by genomax70k

That said you should always (where possible) get the fastq files directly from EBI-ENA and avoid sratoolkit/sra files. - why?

ADD REPLYlink written 2.1 years ago by YaGalbi1.4k

To save time and spare your sanity.

ADD REPLYlink written 2.1 years ago by Devon Ryan91k

True. Too bad my stuff is only available via dbGaP, so having fun with fastq-dump (PITA).

ADD REPLYlink written 2.1 years ago by ATpoint21k
1

Are you able to provide details on the capacity of your University cluster (No. of cores, RAM per core etc)? Also, have you ever completed a run for a single SRA file...just to get a feel for expected time to run? On any of the runs can you get a print out of the max memory usage? Are you able to reserve space on the cluster i.e. maybe other people are also running large jobs at the same time as you and slowing you down.

ADD REPLYlink written 2.1 years ago by YaGalbi1.4k
1

Sure, should have provided these details right away. About 3000 cores, typically 64-256 cores per node, ranging from 128 to 256GB memory per node. We have a storage partition that uses lustre and has a capacity of 180TB. I monitored memory usage, but it was never an issue, especially on jobs where I only tried to cat like 10 fastq files to a big file of ~100GB. Concerning the run time, I sometimes (rarely) can align and sort a 80GB fastq (PE, 2x100bp) in about 10 hours using 24 cores, but typically (especially merging the tmp files from the sorting) takes days.

UPDATE: file system is gpfs, not lustre anymore.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by ATpoint21k

Are you running these jobs on /scratch or a similar folder that is mounted via the network?

On some clusters the computing nodes have their own scratch space that is physically located in the computing node, reading and writing to that space is faster by several orders of magnitude. For example, on RCC's FlashLite each node as an /nvme/ folder which is the preferred place to run things.

ADD REPLYlink written 2.1 years ago by Philipp Bayer6.4k
1

There are many things that can limit performance in a cluster setting. You don't give enough details for us to help you. For example, what kind of filesystem are you using ? If you're doing all this over NFS, forget about it. You need a modern, parallel filesystem.

ADD REPLYlink written 2.1 years ago by Jean-Karim Heriche20k

Please see my comment above for details.

ADD REPLYlink written 2.1 years ago by ATpoint21k
1

Assuming that the filesystem is somehow the bottleneck, is the striping adequate for the size of your files ? See for example here. Check the i/o wait using top. If it's low, then it's probably not a filesystem access issue. Check also that other processes are not consuming resources. Finally it could also be something else in the cluster, so you should also talk to your IT team.

ADD REPLYlink written 2.1 years ago by Jean-Karim Heriche20k

Thanks for pointing me towards striping. As git-lfs is not even installed on the cluster, that could indeed be an issue. I will talk to the admin tomorrow and then report about the results.

ADD REPLYlink written 2.1 years ago by ATpoint21k

This is generally something that the admin should be responsible for fixing. There's no reasonable explanation for the atrocious performance you're seeing.

ADD REPLYlink written 2.1 years ago by Devon Ryan91k

I checked now with the admin and he said that we are using gpfs. I noticed during my testings that reading the files is the bottleneck rather than writing them. Any experiences with gpfs?

ADD REPLYlink written 2.1 years ago by ATpoint21k

Striping is still relevant for GPFS. It is known to deal badly with large numbers of files in the same directory and also, I seem to remember, with concurrent access to the same part of a file. This could be an issue if your programs are multithreaded. Also you could be saturating the network interface. For this kind of problem, you really need the help of the system administrators.

ADD REPLYlink written 2.1 years ago by Jean-Karim Heriche20k

What is "large" in this context? 50 files, 500, 1000?

ADD REPLYlink written 2.1 years ago by ATpoint21k

Depending on your set-up, that would be 1000.

ADD REPLYlink written 2.1 years ago by Jean-Karim Heriche20k

I read several times on the web that GPFS performs poorly when applications use random access, which sratoolkit does. Reading from file was the main bottleneck, writing is fairly ok after all.

ADD REPLYlink written 2.1 years ago by ATpoint21k

It seems that the main bottleneck is reading the files from disk, rather than writing them after being processed

This sounds like a parameter tuning issue for GPFS. Depending on how good/friendly your sys admins are (and if they like a good challenge) you could work with them. Being on a shared cluster changes that may affect other users (but help you) may not always be possible.

ADD REPLYlink written 2.1 years ago by genomax70k
1
gravatar for ATpoint
2.1 years ago by
ATpoint21k
Germany
ATpoint21k wrote:

The solution we came up was the following: Our file system is simply slow, and there was nothing that could really be done about it. The main bottleneck was reading from the file system, rahter than writing. Fortunately, some of the nodes had local SSDs, which I could use. So loaded the SRAs via prefetch (ascp) to the SSD, then fastq-dump them from there, outputting directly to /scratch. Thanks to ascp, the download of a 40-100Gb files was done in no time, and the dumping was speeded-up by (never benchmarked it) I think factor 10. Thanks very much for all your suggestions.

ADD COMMENTlink written 2.1 years ago by ATpoint21k

Our file system is simply slow, and there was nothing that could really be done about it.

Is that what sys admins told you :-) A high performance compute cluster with a slow file system .. that does not seem like a good combination.

Out of curiosity did they communicate with GPFS tech support describing the issue to see if something could be done?

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by genomax70k

I do not know, but since I am beyond caring how I get my data analyzed, all that matters to me know is that it is working^^

ADD REPLYlink written 2.1 years ago by ATpoint21k

since I am beyond caring

How long before the local SSD solution doesn't work anymore for you ? If you're not the only one with the issue, other people would also want to use the local SSDs then everyone will compete for the same nodes.

Also I don't buy the 'GPFS is slow and there's nothing to do' argument. If that's true then there's a problem somewhere. Either the system is misconfigured or someone doesn't know what they're doing.

ADD REPLYlink written 2.1 years ago by Jean-Karim Heriche20k
0
gravatar for sutturka
18 months ago by
sutturka150
USA
sutturka150 wrote:

Please check my experience with SRA download here. It might be useful.

ADD COMMENTlink written 18 months ago by sutturka150
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1722 users visited in the last hour