Question: Understanding Samtools Flagstat Output
gravatar for geek_y
5.7 years ago by
geek_y9.7k wrote:

The following is the output of samtools flagstat command on bam file (paired-end) generated after markDuplicate of Picards.

7417232 + 0 in total (QC-passed reads + QC-failed reads)
287618 + 0 duplicates
4534962 + 0 mapped (61.14%:-nan%)
7417232 + 0 paired in sequencing
3708616 + 0 read1
3708616 + 0 read2
4528278 + 0 properly paired (61.05%:-nan%)
4534962 + 0 with itself and mate mapped

I am having difficulty in understanding whether the duplicates are pairs or single. If there are total of 7417232 pairs and out of them 287618 pairs are duplicates means, there are 3% of duplicate reads in my data. Is my understanding is correct ?

ADD COMMENTlink modified 5.7 years ago by Devon Ryan90k • written 5.7 years ago by geek_y9.7k
gravatar for Devon Ryan
5.7 years ago by
Devon Ryan90k
Freiburg, Germany
Devon Ryan90k wrote:

There aren't 7417232 pairs, but total reads. There are 3708616 actual pairs of reads. It's easiest to figure out exactly what's going on by just looking at the C code in bam_stat.c (it's a short file). The 287618 + 0 duplicates numbers are incremented every time a read is marked as a duplicate. So, if there are 100 duplicates for one read then that number will increase by 100 (not 1). Practically speaking, that would mean that there are 143809 duplicate pairs, which could all be the same duplicate (or not). Regardless, that's about 3% (closer to 4% I think) of duplicates, so that part of the interpretation is spot on.

ADD COMMENTlink written 5.7 years ago by Devon Ryan90k

just a quick note to clarify that flagstat reads the bam file flags, so it reports how many reads have been flagged as duplicated in that particular file. you will find that this number varies from one to other algorithm.

ADD REPLYlink written 5.7 years ago by Jorge Amigo11k

Good clarification, my wording was a bit sloppy!

ADD REPLYlink written 5.7 years ago by Devon Ryan90k

So when picards marks a read as duplicate in paired end bam file, It looks for start and end position of both the read pairs ? For example, two reads reads are there Read1 and Read2. So when marking duplicate, picards will look for ( Start and end position of Read1_r1 and Read2_r1 should be same && start and end position of Read1_r2 and Read2_r2 should be same ) ?? When it finds a duplicate read...will it flag both the reads as duplicates ? if it is so...while removing duplicate reads...we loose both the reads ?

ADD REPLYlink modified 5.7 years ago • written 5.7 years ago by geek_y9.7k

I recall that it only looks at the start position, though I don't recall 100%. It should mark all but one of the duplicates. The unmarked read should have the best quality. At least that's my recollection.

ADD REPLYlink written 5.7 years ago by Devon Ryan90k

Hi Devon, how can i get bam_stat.c ?

ADD REPLYlink written 9 months ago by canmcgill0

It's in the samtools source code:

ADD REPLYlink written 9 months ago by Devon Ryan90k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1461 users visited in the last hour