samtools flagstat calculation
1
0
Entering edit mode
4.6 years ago
Nitha ▴ 20

Hi all,

I have a doubt in samtools flagstat calculation, alignment quality check was done for the same samples but the alignment was done in different ways. Downloaded public available sample it was 80Gb fastq with Lane 1 and lane 2 which forward and reverse reads (paired end).

  1. merged reads lane 1 & lane 2 were taken for the bwa mem alignment with grch38, and various step dupemerge step was done and this output was taken for an alignment quality check by samtools flagstat.

  2. The second way, each raw fastq reads of both lane were taken and each read was aligned to grch38, after dupmerge step the reads were merged into one whole.bam file using samtools merge and the quality check done for the merged file.

But I observed the number of duplicates is decreased in the second way generated .bam file.

alignment results:

Reads are merged initial and used for further processing:

$samtools flagstat sample1_dup_merged.bam
603909814 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
1723882 + 0 supplementary
**7197218 + 0 duplicates**
601809982 + 0 mapped **(99.65% : N/A)**
602185932 + 0 paired in sequencing
301084296 + 0 read1
301101636 + 0 read2
590992186 + 0 properly paired (98.14% : N/A)
597986268 + 0 with itself and mate mapped
2099832 + 0 singletons (0.35% : N/A)
3880858 + 0 with mate mapped to a different chr
2373672 + 0 with mate mapped to a different chr (mapQ>=5)

Each read of L1(40 reads) & L2 (39 reads) was as input for each processes:

$samtools flagstat sample1_Merged.bam
603910224 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
1723659 + 0 supplementary
**208749 + 0 duplicates**
601810438 + 0 mapped **(99.65% : N/A)**
602186565 + 0 paired in sequencing
301084777 + 0 read1
301101788 + 0 read2
590992150 + 0 properly paired (98.14% : N/A)
597986993 + 0 with itself and mate mapped
2099786 + 0 singletons (0.35% : N/A)
3881347 + 0 with mate mapped to a different chr  
2373432 + 0 with mate mapped to a different chr (mapQ>=5)

Could anyone tell me how this samtools calculate duplicates and in this output why the number of the duplicate is different but the quality remains the same in both ways?

Thanks

samtools • 1.7k views
ADD COMMENT
1
Entering edit mode

Please show code to reproduce what you did. For code highlighting please use the formatting option 10101.

ADD REPLY
1
Entering edit mode

I don't understand your explanations: "various step dupemerge" could hide a multitude of sins! If you're using different pipelines and comparing the stats at the end then can see why flagstats would be different and I don't see why you'd assume samtools is the cause.

Flagstat is a very trivial algorithm and I doubt it's wrong. You could validate it trivially by just counting how many reads have each bit set (eg a perl or awk one liner could do this).

ADD REPLY
3
Entering edit mode
4.6 years ago

Calling duplicates on each lane will artificially decrease the number of called duplicates, which is exactly what you see. PCR duplicates are a library-level phenomenon, so they should be marked using a BAM file representing the entire library (i.e., all of the lanes).

A jkbonfield mentioned, it is almost inconceivable that samtools flagstat is wrong.

ADD COMMENT

Login before adding your answer.

Traffic: 1959 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6