Forgive the silly question but I'm having a problem with concatenation that is driving me a little mad.
I have a sequencing run with reads for each sample spread across multiple lanes. So I wanted to concatenate them before proceeding with mapping and further downstream analysis.
I looked up how to concatenate multiple fastq files on biostars and found this great answer: merge large amount of fastq files into a single one
I proceeded to concatenate the multiple lanes using:
cat *fastq.gz > merged.fastq.gz
The problem is when I count the # of reads in each individual file and add it all up I get 31764073 reads however when I cat them together and count I only get 15434478 reads. I tried typing the file names out one by one and got the same result as file globbing above.
I'm counting the number of reads using (Sequence Number Count In Fastq.Gz File) :
zcat my.fastq.gz | echo $((`wc -l`/4))
Can anyone help me understand what is happening? Am I losing some of these reads in the concatenation process?
There's likely a fastq file with a typo in the file name, such that it doesn't end in
fastq.gz
.As an aside, tell your sequencing provider that bcl2fastq has a
--no-lane-splitting
option that they could have used to obviate the need for you to merge the files.Josh,
*fastq.gz
is all FASTQ files. Are you sure you don't have Paired End reads that need to be concatenated separately into 2 different files? I'd check on downstream tool requirements before doing thiscat
.