We are currently downloading and analyzing multiple large WGS datasets (30-50x) from patients. So far, we downloaded the data of 20 patients from dbGaP/NCBI (tumor and matched normal respectively). More samples are planned to be included. The download itself via prefetch/fasp was relatively fast and smooth but now the problems begin, so maybe you have some experience in how to optimize things.
-Sra to fastq via fastq-dump is often unbearably slow. Not only is the fastq-dump slow itself, but I often experience I/O bottlenecks on our university cluster, which uses gpfs (not lustre as I stated yesterday). fastq-dump is often stuck in "D"-state, so uninterruptable sleep. To speed up things, I dumped large sra into several fastq files, using the -N and -X options, but merging these chunks via GNU cat was also extremely slow, sometimes with only a few hundred MB in several hours. Is that normal (the server does not run on SSD as far as I know).
-same goes for alignment sorting. I tried to rather use fewer threads with SAMtools sort but more memory per thread to avoid the creation of too many tmp files, which then need to be merged again. still, even merging few (< 50) files takes hours and hours of time, again with only few hundreds MB in several hours. That often collides with the walltimes.
It would be great if you could share your experiences in how to handle these Terabyte-scale data, and what tricks one can apply in order to avoid performance bottlenecks.
UPDATE: It seems that the main bottleneck is reading the files from disk, rather than writing them after being processed.