Question: Concatenate fastq.gz - less reads after concatenation than before?
0
gravatar for josh.cutts1
2.0 years ago by
josh.cutts110
josh.cutts110 wrote:

Forgive the silly question but I'm having a problem with concatenation that is driving me a little mad.

I have a sequencing run with reads for each sample spread across multiple lanes. So I wanted to concatenate them before proceeding with mapping and further downstream analysis.

I looked up how to concatenate multiple fastq files on biostars and found this great answer: merge large amount of fastq files into a single one

I proceeded to concatenate the multiple lanes using:

cat *fastq.gz > merged.fastq.gz

The problem is when I count the # of reads in each individual file and add it all up I get 31764073 reads however when I cat them together and count I only get 15434478 reads. I tried typing the file names out one by one and got the same result as file globbing above.

I'm counting the number of reads using (Sequence Number Count In Fastq.Gz File) :

zcat my.fastq.gz | echo $((`wc -l`/4))

Can anyone help me understand what is happening? Am I losing some of these reads in the concatenation process?

sequencing chip-seq next-gen • 907 views
ADD COMMENTlink written 2.0 years ago by josh.cutts110
1

There's likely a fastq file with a typo in the file name, such that it doesn't end in fastq.gz.

As an aside, tell your sequencing provider that bcl2fastq has a --no-lane-splitting option that they could have used to obviate the need for you to merge the files.

ADD REPLYlink written 2.0 years ago by Devon Ryan94k

Josh, *fastq.gz is all FASTQ files. Are you sure you don't have Paired End reads that need to be concatenated separately into 2 different files? I'd check on downstream tool requirements before doing this cat.

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by RamRS26k
0
gravatar for josh.cutts1
2.0 years ago by
josh.cutts110
josh.cutts110 wrote:

Arg sorry! I made a mistake that I can't reproduce.

I tried to recreate the problem and it has gone away and everything adds up correctly. I just have less reads than our sequencing provider said we would but the counts add up in the individual files so I need to follow up with them.

Thanks for your help Devon and Ram. Good to know that there is a no lane splitting option for the future!

ADD COMMENTlink written 2.0 years ago by josh.cutts110

Good to know this, Josh. I'm moving your post to an accepted answer to provide this thread with closure. For people that visit this post in the future, the takeaway is: try to reproduce the problem :-)

ADD REPLYlink written 2.0 years ago by RamRS26k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1847 users visited in the last hour