How do I count the real total reads from a fastq file?
Entering edit mode
14 months ago
giusdalt95 ▴ 10

Good morning everybody, I was doing some simple statistics on the 2 fastq file of a sample (forward and reverse). I simply wanted to obtain the number of the total reads for each of the two files, through this command:

zcat $SAMPLE.fwd.fq.gz | wc -l | awk '{print $1/4}'
zcat $SAMPLE.rev.fq.gz | wc -l | awk '{print $1/4}'

the sum of the two resulting numbers is 640000162. I thought that this should be the number of total reads, but when I run samtools flagstat on the resulting bam ($, here comes the output:

643938688 + 0 in total (QC-passed reads + QC-failed reads)
3938526 + 0 secondary
0 + 0 supplementary
129129835 + 0 duplicates
642067899 + 0 mapped (99.71% : N/A)
640000162 + 0 paired in sequencing
320000081 + 0 read1
320000081 + 0 read2
623066112 + 0 properly paired (97.35% : N/A)
637384172 + 0 with itself and mate mapped
745201 + 0 singletons (0.12% : N/A)
8707820 + 0 with mate mapped to a different chr
4103776 + 0 with mate mapped to a different chr (mapQ>=5)

So, here comes that the reads are actually 643938688 and not 640000162 (paired in sequencing) according to samtools. Now, I am a bit confused about what these values mean, if the total reads are 643938688 or 640000162 (excuse me for my naiveness). Could you give me some hint or some explanation about? Thank you, Giuseppe.

NGS Fastq End Paired Sequencing • 1.1k views
Entering edit mode
14 months ago

secondary alignments: (reads mapping at >1 position)

3938526 + 0 secondary

643938688 - 3938526 = 640000162

Entering edit mode
14 months ago
Sukjun Kim ▴ 90

The term 'reads' used in samtools' flagstat is more about 'the reported alignments' rather than the sequenced reads. You should not be confused with the number of reads in FASTQ file. It will be easier to understand when you think about it in this way, for example:

  • 640000162 is the number of 'reads' in FASTQ file
  • 643938688 is the number of 'alignments' in BAM file

'read' contains only sequence information, whereas 'alignment' contains information of the mapped position as well as sequence of the read.

Entering edit mode

Thanks for your comments Sukjun Kim. How is possible to have more alignments in a BAM file than reads in the FASTQ file?


Login before adding your answer.

Traffic: 2021 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6