Question

Experiences with large datasets

1

Entering edit mode

6.8 years ago

ATpoint 81k

We are currently downloading and analyzing multiple large WGS datasets (30-50x) from patients. So far, we downloaded the data of 20 patients from dbGaP/NCBI (tumor and matched normal respectively). More samples are planned to be included. The download itself via prefetch/fasp was relatively fast and smooth but now the problems begin, so maybe you have some experience in how to optimize things.

-Sra to fastq via fastq-dump is often unbearably slow. Not only is the fastq-dump slow itself, but I often experience I/O bottlenecks on our university cluster, which uses gpfs (not lustre as I stated yesterday). fastq-dump is often stuck in "D"-state, so uninterruptable sleep. To speed up things, I dumped large sra into several fastq files, using the -N and -X options, but merging these chunks via GNU cat was also extremely slow, sometimes with only a few hundred MB in several hours. Is that normal (the server does not run on SSD as far as I know).

-same goes for alignment sorting. I tried to rather use fewer threads with SAMtools sort but more memory per thread to avoid the creation of too many tmp files, which then need to be merged again. still, even merging few (< 50) files takes hours and hours of time, again with only few hundreds MB in several hours. That often collides with the walltimes.

It would be great if you could share your experiences in how to handle these Terabyte-scale data, and what tricks one can apply in order to avoid performance bottlenecks.

UPDATE: It seems that the main bottleneck is reading the files from disk, rather than writing them after being processed.

WGS Alignment Big Data Server Linux • 2.7k views

ADD COMMENT • link updated 6.2 years ago by sutturka ▴ 190 • written 6.8 years ago by ATpoint 81k

2

Entering edit mode

Have you considered the possibility that your account may simply be getting throttled having used too many resources (if accounting is enabled on this cluster, which it likely is) in a said accounting period. A cluster is, by design, meant for "fair share". Once you go over a certain amount of resource usage (CPU hours etc per accounting period) your jobs may be receiving the least priority.

That said you should always (where possible) get the fastq files directly from EBI-ENA and avoid sratoolkit/sra files.

ADD REPLY • link 6.8 years ago by GenoMax 141k

0

Entering edit mode

That said you should always (where possible) get the fastq files directly from EBI-ENA and avoid sratoolkit/sra files. - why?

ADD REPLY • link 6.8 years ago by BioinfGuru ★ 1.7k

0

Entering edit mode

To save time and spare your sanity.

ADD REPLY • link 6.8 years ago by Devon Ryan 104k

0

Entering edit mode

True. Too bad my stuff is only available via dbGaP, so having fun with fastq-dump (PITA).

ADD REPLY • link 6.8 years ago by ATpoint 81k

1

Entering edit mode

Are you able to provide details on the capacity of your University cluster (No. of cores, RAM per core etc)? Also, have you ever completed a run for a single SRA file...just to get a feel for expected time to run? On any of the runs can you get a print out of the max memory usage? Are you able to reserve space on the cluster i.e. maybe other people are also running large jobs at the same time as you and slowing you down.

ADD REPLY • link 6.8 years ago by BioinfGuru ★ 1.7k

1

Entering edit mode

Sure, should have provided these details right away. About 3000 cores, typically 64-256 cores per node, ranging from 128 to 256GB memory per node. We have a storage partition that uses lustre and has a capacity of 180TB. I monitored memory usage, but it was never an issue, especially on jobs where I only tried to cat like 10 fastq files to a big file of ~100GB. Concerning the run time, I sometimes (rarely) can align and sort a 80GB fastq (PE, 2x100bp) in about 10 hours using 24 cores, but typically (especially merging the tmp files from the sorting) takes days.

UPDATE: file system is gpfs, not lustre anymore.

ADD REPLY • link 6.8 years ago by ATpoint 81k

0

Entering edit mode

Are you running these jobs on /scratch or a similar folder that is mounted via the network?

On some clusters the computing nodes have their own scratch space that is physically located in the computing node, reading and writing to that space is faster by several orders of magnitude. For example, on RCC's FlashLite each node as an /nvme/ folder which is the preferred place to run things.

ADD REPLY • link 6.8 years ago by Philipp Bayer 8.3k

1

Entering edit mode

There are many things that can limit performance in a cluster setting. You don't give enough details for us to help you. For example, what kind of filesystem are you using ? If you're doing all this over NFS, forget about it. You need a modern, parallel filesystem.

ADD REPLY • link 6.8 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Please see my comment above for details.

ADD REPLY • link 6.8 years ago by ATpoint 81k

1

Entering edit mode

Assuming that the filesystem is somehow the bottleneck, is the striping adequate for the size of your files ? See for example here. Check the i/o wait using top. If it's low, then it's probably not a filesystem access issue. Check also that other processes are not consuming resources. Finally it could also be something else in the cluster, so you should also talk to your IT team.

ADD REPLY • link 6.8 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thanks for pointing me towards striping. As git-lfs is not even installed on the cluster, that could indeed be an issue. I will talk to the admin tomorrow and then report about the results.

ADD REPLY • link 6.8 years ago by ATpoint 81k

0

Entering edit mode

This is generally something that the admin should be responsible for fixing. There's no reasonable explanation for the atrocious performance you're seeing.

ADD REPLY • link 6.8 years ago by Devon Ryan 104k

0

Entering edit mode

I checked now with the admin and he said that we are using gpfs. I noticed during my testings that reading the files is the bottleneck rather than writing them. Any experiences with gpfs?

ADD REPLY • link 6.8 years ago by ATpoint 81k

0

Entering edit mode

Striping is still relevant for GPFS. It is known to deal badly with large numbers of files in the same directory and also, I seem to remember, with concurrent access to the same part of a file. This could be an issue if your programs are multithreaded. Also you could be saturating the network interface. For this kind of problem, you really need the help of the system administrators.

ADD REPLY • link 6.8 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

What is "large" in this context? 50 files, 500, 1000?

ADD REPLY • link 6.8 years ago by ATpoint 81k

0

Entering edit mode

Depending on your set-up, that would be 1000.

ADD REPLY • link 6.8 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

I read several times on the web that GPFS performs poorly when applications use random access, which sratoolkit does. Reading from file was the main bottleneck, writing is fairly ok after all.

ADD REPLY • link 6.8 years ago by ATpoint 81k

0

Entering edit mode

It seems that the main bottleneck is reading the files from disk, rather than writing them after being processed

This sounds like a parameter tuning issue for GPFS. Depending on how good/friendly your sys admins are (and if they like a good challenge) you could work with them. Being on a shared cluster changes that may affect other users (but help you) may not always be possible.

ADD REPLY • link 6.8 years ago by GenoMax 141k

0

Entering edit mode

6.2 years ago

sutturka ▴ 190

Please check my experience with SRA download here. It might be useful.

ADD COMMENT • link 6.2 years ago by sutturka ▴ 190

score 1 · Accepted Answer · 2017-07-13

1

Entering edit mode

6.8 years ago

ATpoint 81k

The solution we came up was the following: Our file system is simply slow, and there was nothing that could really be done about it. The main bottleneck was reading from the file system, rahter than writing. Fortunately, some of the nodes had local SSDs, which I could use. So loaded the SRAs via prefetch (ascp) to the SSD, then fastq-dump them from there, outputting directly to /scratch. Thanks to ascp, the download of a 40-100Gb files was done in no time, and the dumping was speeded-up by (never benchmarked it) I think factor 10. Thanks very much for all your suggestions.

ADD COMMENT • link 6.8 years ago by ATpoint 81k

0

Entering edit mode

Our file system is simply slow, and there was nothing that could really be done about it.

Is that what sys admins told you :-) A high performance compute cluster with a slow file system .. that does not seem like a good combination.

Out of curiosity did they communicate with GPFS tech support describing the issue to see if something could be done?

ADD REPLY • link 6.8 years ago by GenoMax 141k

0

Entering edit mode

I do not know, but since I am beyond caring how I get my data analyzed, all that matters to me know is that it is working^^

ADD REPLY • link 6.8 years ago by ATpoint 81k

0

Entering edit mode

since I am beyond caring

How long before the local SSD solution doesn't work anymore for you ? If you're not the only one with the issue, other people would also want to use the local SSDs then everyone will compete for the same nodes.

Also I don't buy the 'GPFS is slow and there's nothing to do' argument. If that's true then there's a problem somewhere. Either the system is misconfigured or someone doesn't know what they're doing.

ADD REPLY • link 6.8 years ago by Jean-Karim Heriche 27k