Question

Understanding Samtools Flagstat Output

8

Entering edit mode

11.7 years ago

GouthamAtla 12k

The following is the output of samtools flagstat command on bam file (paired-end) generated after markDuplicate of Picards.

7417232 + 0 in total (QC-passed reads + QC-failed reads)
287618 + 0 duplicates
4534962 + 0 mapped (61.14%:-nan%)
7417232 + 0 paired in sequencing
3708616 + 0 read1
3708616 + 0 read2
4528278 + 0 properly paired (61.05%:-nan%)
4534962 + 0 with itself and mate mapped

I am having difficulty in understanding whether the duplicates are pairs or single. If there are total of 7417232 pairs and out of them 287618 pairs are duplicates means, there are 3% of duplicate reads in my data. Is my understanding is correct ?

samtools picard markduplicates illumina paired-end • 43k views

ADD COMMENT • link updated 5.7 years ago by Nitha ▴ 20 • written 11.7 years ago by GouthamAtla 12k

0

Entering edit mode

5.7 years ago

Nitha ▴ 20

@Devon Ryan, I had used two bam files of same samples (bam file obtained by two different ways 1. By using joined fastq files as input and 2. By using each fastq files for each steps) for alignment quality check using samtools falg stat. But i observed difference in duplicate numbers.

The .bam files generated during dupmerge step were used to for quality check

Fastqfiles are merged initial and used for further processing
samtools flagstat Sample1-A_dup_merged.bam
603909814 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
1723882 + 0 supplementary
7197218 + 0 duplicates
601809982 + 0 mapped (99.65% : N/A)
602185932 + 0 paired in sequencing
301084296 + 0 read1
301101636 + 0 read2
590992186 + 0 properly paired (98.14% : N/A)
597986268 + 0 with itself and mate mapped
2099832 + 0 singletons (0.35% : N/A)
3880858 + 0 with mate mapped to a different chr
2373672 + 0 with mate mapped to a different chr (mapQ>=5)

********************
Each fastq files of L1(40 fastq files) & L2 (39 fastqfiles) were as input for each processes
samtools flagstat Sample1-A_Merged.bam
603910224 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
1723659 + 0 supplementary
208749 + 0 duplicates
601810438 + 0 mapped (99.65% : N/A)
602186565 + 0 paired in sequencing
301084777 + 0 read1
301101788 + 0 read2
590992150 + 0 properly paired (98.14% : N/A)
597986993 + 0 with itself and mate mapped
2099786 + 0 singletons (0.35% : N/A)
3881347 + 0 with mate mapped to a different chr
2373432 + 0 with mate mapped to a different chr (mapQ>=5)

Could you please help me to understand why this difference occurs?
Thanks!

ADD COMMENT • link 5.7 years ago by Nitha ▴ 20

0

Entering edit mode

This should be a new question, though the answer is that if you call duplicates on the smaller file you will necessarily find fewer duplicates (just think about how calling duplicates works).

ADD REPLY • link 5.7 years ago by Devon Ryan 105k

0

Entering edit mode

Thankyou@DevonRyan. Whether I have to delete this post and re-post as a new question?

ADD REPLY • link 5.7 years ago by Nitha ▴ 20

0

Entering edit mode

Given that the second sentence in my comment was the answer you might as well just leave things as they are.

ADD REPLY • link 5.7 years ago by Devon Ryan 105k

score 13 · Accepted Answer · 2013-10-24

13

Entering edit mode

11.7 years ago

Devon Ryan 105k

There aren't 7417232 pairs, but total reads. There are 3708616 actual pairs of reads. It's easiest to figure out exactly what's going on by just looking at the C code in bam_stat.c (it's a short file). The 287618 + 0 duplicates numbers are incremented every time a read is marked as a duplicate. So, if there are 100 duplicates for one read then that number will increase by 100 (not 1). Practically speaking, that would mean that there are 143809 duplicate pairs, which could all be the same duplicate (or not). Regardless, that's about 3% (closer to 4% I think) of duplicates, so that part of the interpretation is spot on.

ADD COMMENT • link 11.7 years ago by Devon Ryan 105k

1

Entering edit mode

just a quick note to clarify that flagstat reads the bam file flags, so it reports how many reads have been flagged as duplicated in that particular file. you will find that this number varies from one to other algorithm.

ADD REPLY • link 11.7 years ago by Jorge Amigo 14k

0

Entering edit mode

Good clarification, my wording was a bit sloppy!

ADD REPLY • link 11.7 years ago by Devon Ryan 105k

0

Entering edit mode

So when picards marks a read as duplicate in paired end bam file, It looks for start and end position of both the read pairs ? For example, two reads reads are there Read1 and Read2. So when marking duplicate, picards will look for ( Start and end position of Read1_r1 and Read2_r1 should be same && start and end position of Read1_r2 and Read2_r2 should be same ) ?? When it finds a duplicate read...will it flag both the reads as duplicates ? if it is so...while removing duplicate reads...we loose both the reads ?

ADD REPLY • link 11.7 years ago by GouthamAtla 12k

2

Entering edit mode

I recall that it only looks at the start position, though I don't recall 100%. It should mark all but one of the duplicates. The unmarked read should have the best quality. At least that's my recollection.

ADD REPLY • link 11.7 years ago by Devon Ryan 105k

0

Entering edit mode

Hi Devon, how can i get bam_stat.c ?

ADD REPLY • link 6.8 years ago by canmcgill • 0

0

Entering edit mode

It's in the samtools source code: https://github.com/samtools/samtools/blob/develop/bam_stat.c

ADD REPLY • link 6.8 years ago by Devon Ryan 105k