cat files with GNU Parallel
3
0
Entering edit mode
6.8 years ago
tayebwajb ▴ 100

I have several fastq files with the following naming convention:

SampleID_SampleNo_LaneNo_R1_001.fastq.gz. The SampleNo ranges from S0 to S87 while the LaneNo ranges from L001 to L004. An example name is 1025-00_S15_L002_R1_001.fastq.gz I want to use GNU parallel to put all *R1_001.fastq.gz per sample in one file irrespective of the LaneNo. Something like

parallel --dryrun -v --progress "cat  {} > SampleID_SampleNo_R1_001.fastq.gz" :::  *L00{1..4}_R1_001.fastq.gz
gnu parallel awk cat • 5.1k views
0
Entering edit mode

FYI : If you want to map your fastq with bwa or whatever, believe me, you don't want to concatenate those files.

1
Entering edit mode

How about I map each lane fastq separately so that I get a bam per lane and merge these bams to get per sample bam? So then there is no need to concatenate.

0
Entering edit mode

This is usually what I do, and you can go one step further and just map all the file parts separately and then merge the alignment by lane and then sample.

0
Entering edit mode

Yes, the 001 part is the first portion on the edge of the flow-cell and usually has the worst quality.

0
Entering edit mode
6.8 years ago

This is not the time to use parallelization. You want the files to go together in a defined order, and the same order. Parallelizing could make separate bits go in front or behind each other. Asynchronous processing will cause nondeterministic resource allocation.

Also: you won't get much speedup because 'cat' doesnt use a lot of CPU, you're already bound by the speed of the disk drives, so accessing them multiple times is a detriment.

You could fix both issues with a binary tree system. Doing pairs and working up to a complete picture. It's more effort; worthwhile if you do this task repeatedly.  Let the operating system and RAID drivers handle that kind of parallelization. With regards to the userland application, a single "cat" will dump the datafile as fast as possible.

0
Entering edit mode
6.8 years ago

I don't believe you need or want to use Parallel for this as you don't need the parallel execution. Try grep/xargs/cat:

ls *.fastq.gz | egrep '1025-00_S15_L00[1-4]_R1_001' | xargs cat >> 1025-00_S15_R1_001.fastq.gz

0
Entering edit mode
6.8 years ago
ole.tange ★ 4.0k

If you have GNU Parallel 20140822 this should work:

parallel eval 'cat {=s/L001_/L*_/=} > {=s/L001_//=}' ::: *_L001_R1_001.fastq.gz

karl.stamm mentions that you may be limited by disk speed. While that used to be absolutely true (and still is if you only have a single harddisk), it is no longer always true on RAID drives, network drives and SSD drives. On a RAID device I experienced 600% speed up if I ran 10 jobs in parallel, and a smaller speedup if I run either more or fewer jobs in parallel. In other words the only way you can know is by trying and measuring.

1
Entering edit mode

Linux does a great job of caching disk access with RAM, so it could be fooling you and finishing up the process in the background. To get a 600% speedup, you'll have to be writing to six drives at once, which would imply more than seven drives in the RAID array. If your network storage is really huge, this is quite possible. My servers will only write with two drives. If your SAN is in the petabyte range, of course it will have more R/W heads available, and probably another layer of RAM cache.

1
Entering edit mode

The RAID was a 40 disk RAID60, so it is not unreasonable to get a 600% speedup. The size of the data was so much bigger than RAM that caching in practice would have no effect. It still does not change the conclusion: What used to be always true about parallel disk I/O, sometimes no longer is, so measure instead of assume.