Merging two fastq.gz files
3
0
Entering edit mode
6.0 years ago
tcf.hcdg ▴ 70

Hello,

I have 96 *fastqc.gz raw read files from 24 samples. Each sample was sequenced on two lanees for each pair.

I would like to merge reads for each pair from both lanes into one output file with same name identifier from sample file name (2271_merged_R1_001.fastq.gz).

File names are in this order:
22[71-94]*R[1-2]_001.fastq.gz;

**2271**_ID890_1_S1_L001_**R1_001.fastq.gz**
**2271**_ID890_1_S1_L002_**R1_001.fastq.gz**

**2271**_ID890_1_S1_L001_**R2_001.fastq.gz**
**2271**_ID890_1_S1_L002_**R2_001.fastq.gz**

I tried the following short script but only two output files are being generated (first and the last).

FOR R1 files

  for rf in 22[71-94]*R1_001.fastq.gz; do zcat $rf > 22"${71-94}"_merged_R1_001.fastq.gz ; done

FOR R2 files

for rf in 22[71-94]*R2_001.fastq.gz; do zcat $rf > 22"${71-94}"_merged_R2_001.fastq.gz ; done

My Questions are: 1. Why only two output files are generated? 2. The number of reads in the out put files are not the sum of the merged files from both lanes. 3. Is there a nice way, I could do the merging of reads from both lanes for both (R1 and R2) in single step instead of running it two times for each read type.

What went wrong in the code? and how could I verify that the output files are completely merged?

Thanks

fastq merging • 6.0k views
ADD COMMENT
0
Entering edit mode

For 48 files for R1, following code will work ( Take a back up of your work and try on 1-2 sets before using. Match MD5sums):

$ for i in   *1_R1_001.fastq.gz; do zcat ${i%%01*}01_R1_001.fastq.gz ${i%%01*}02_R1_001.fastq.gz| gzip -c - > ${i%%_*}_"merged_R"${i#*_R*} ; done

Works for R2 as well. Output file names would be: 2271_merged_R1_001.fastq.gz for 2271 R1.

ADD REPLY
1
Entering edit mode
6.0 years ago

not need to use gzcat, just use cat merge large amount of fastq files into a single one

ADD COMMENT
0
Entering edit mode
6.0 years ago
yhoogstrate ▴ 140

Is this what you're looking for maybe?:

for rf in 22[71-94]*R1_001.fastq.gz; do cat $rf >> 22"${71-94}"_merged_R1_001.fastq.gz ; done

zcat extracts, which is unnecessary as you dump it into a .gz file. Also, >> appends, > overwrites, of which appending seems what you need?

I hope this helps you a bit.

Enjoy,

Youri

ADD COMMENT
0
Entering edit mode

And What about " 1. Why only two output files are generated? "

ADD REPLY
1
Entering edit mode

I used the following and it worked:

R1

for ((num=71; num<=94; num++)); { cat 22"$num"*{L001,L002}_R1_001.fastq.gz > "22${num}_merged_R1_001.fastq.gz" ;}

R2

for ((num=71; num<=94; num++)); { cat 22"$num"*{L001,L002}_R1_001.fastq.gz > "22${num}_merged_R1_001.fastq.gz" ;}
ADD REPLY
0
Entering edit mode
6.0 years ago
igor 13k

If you are not sure what your code is doing, try checking what is actually happening. Instead of generating the final file blindly and hoping it is working properly, print the progress. For example, you can check which inputs are getting paired with which outputs:

for rf in 22[71-94]*R1_001.fastq.gz; do
  echo "$rf  to  22${71-94}_merged_R1_001.fastq.gz"
done
ADD COMMENT

Login before adding your answer.

Traffic: 1487 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6