I have several fastq files with the following naming convention:
SampleID_SampleNo_LaneNo_R1_001.fastq.gz. The SampleNo ranges from S0 to S87 while the LaneNo ranges from L001 to L004. An example name is 1025-00_S15_L002_R1_001.fastq.gz I want to use GNU parallel to put all *R1_001.fastq.gz per sample in one file irrespective of the LaneNo. Something like
How about I map each lane fastq separately so that I get a bam per lane and merge these bams to get per sample bam? So then there is no need to concatenate.
This is usually what I do, and you can go one step further and just map all the file parts separately and then merge the alignment by lane and then sample.
This is not the time to use parallelization. You want the files to go together in a defined order, and the same order. Parallelizing could make separate bits go in front or behind each other. Asynchronous processing will cause nondeterministic resource allocation.
Also: you won't get much speedup because 'cat' doesnt use a lot of CPU, you're already bound by the speed of the disk drives, so accessing them multiple times is a detriment.
You could fix both issues with a binary tree system. Doing pairs and working up to a complete picture. It's more effort; worthwhile if you do this task repeatedly. Let the operating system and RAID drivers handle that kind of parallelization. With regards to the userland application, a single "cat" will dump the datafile as fast as possible.
karl.stamm mentions that you may be limited by disk speed. While that used to be absolutely true (and still is if you only have a single harddisk), it is no longer always true on RAID drives, network drives and SSD drives. On a RAID device I experienced 600% speed up if I ran 10 jobs in parallel, and a smaller speedup if I run either more or fewer jobs in parallel. In other words the only way you can know is by trying and measuring.
Linux does a great job of caching disk access with RAM, so it could be fooling you and finishing up the process in the background. To get a 600% speedup, you'll have to be writing to six drives at once, which would imply more than seven drives in the RAID array. If your network storage is really huge, this is quite possible. My servers will only write with two drives. If your SAN is in the petabyte range, of course it will have more R/W heads available, and probably another layer of RAM cache.
The RAID was a 40 disk RAID60, so it is not unreasonable to get a 600% speedup. The size of the data was so much bigger than RAM that caching in practice would have no effect. It still does not change the conclusion: What used to be always true about parallel disk I/O, sometimes no longer is, so measure instead of assume.
ADD REPLY
• link
updated 2.8 years ago by
Ram
44k
•
written 10.1 years ago by
ole.tange
★
4.5k
FYI : If you want to map your fastq with bwa or whatever, believe me, you don't want to concatenate those files.
How about I map each lane fastq separately so that I get a bam per lane and merge these bams to get per sample bam? So then there is no need to concatenate.
This is usually what I do, and you can go one step further and just map all the file parts separately and then merge the alignment by lane and then sample.
Yes, the 001 part is the first portion on the edge of the flow-cell and usually has the worst quality.