How to use multiples computational nodes/cores for Merging .fastq.gz files
2
0
Entering edit mode
8.6 years ago
ravi.uhdnis ▴ 220

Hi,

I want to merge multiple .fastq.gz files (forward/Reverse), and using following command:

zcat dir1/ETH002281_ACAGTG_L00*_R1_00*.fastq.gz dir2/ETH002281_ACAGTG_L00*_R1_00*.fastq.gz dir3/ETH002281_ACAGTG_L003_R1_001.fastq.gz | gzip > dir4/ETH002281_ACAGTG_Lall_R1.gz

Although it run fine but it takes huge time as I am able to run it on single node, I want to run it on multiples nodes as I have access of 15 nodes with 8 cores each. It would be great if I get idea how to merge multiples fastq.gz files using various computational nodes in order to finish the job earliest using maximum computational power of nodes. Thanks

Assembly next-gen-sequencing • 2.7k views
ADD COMMENT
3
Entering edit mode
8.6 years ago
h.mon 35k

You may use pigz, a parallel gz replacement, or simply cat instead of zcat|gzip:

cat  dir1/ETH002281_ACAGTG_L00*_R1_00*.fastq.gz dir2/ETH002281_ACAGTG_L00*_R1_00*.fastq.gz dir3/ETH002281_ACAGTG_L003_R1_001.fastq.gz > dir4/ETH002281_ACAGTG_Lall_R1.gz

Compression will not be as good as zcat|gzip, but it will be much faster.

ADD COMMENT
0
Entering edit mode

Thank you for response. I'll try pigz while using zcat | gzip. True, cat command is much faster(approx 40X) in comparison of gcat|gzip but i want to avoid it just as it doesn't compress merged files, expecting size differences in GBs of final merged files.

ADD REPLY
2
Entering edit mode

You can concatenate gzipped files and the result is still a valid compressed gzipped file; I don't really see any reason to avoid that. The difference in compression would be negligible compared to recompressing it unless you have millions of tiny files.

ADD REPLY
0
Entering edit mode

I agree that cat-ing gzip files is the best solution here. However, I vaguely remember that, strictly speaking, a gzip file produced by concatenating individual gzips is not "valid" since the footer of the concatenated files does not represent the whole file but only the last gzip file concatenated.

ADD REPLY
2
Entering edit mode
8.6 years ago

see if this post helps : Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them. This post talks about parallel program with examples using zcat.

ADD COMMENT

Login before adding your answer.

Traffic: 1497 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6