Getting Confused With The Flagstat After Pcr Duplicates Removed
2
1
Entering edit mode
12.1 years ago
KJ Lim ▴ 140

Good day.

I encountered a situation like below:

The flagstat before PCR duplicates removed from paired end mapped reads.

:::::::::::::: 
0H.flagstat.txt
::::::::::::::
173146136 + 0 in total 
0 + 0 duplicates
130510023 + 0 mapped (75.38%:nan%)
173146136 + 0 paired in sequencing
86573068 + 0 read1  <--
86573068 + 0 read2  <--
87873910 + 0 properly paired (50.75%:nan%)
87873910 + 0 with itself and mate mapped
42636113 + 0 singletons (24.62%:nan%)

The flagstat information after PCR duplicates removed with Picard MarkDuplicates tool from paired end mapped reads.

::::::::::::::
0H.ptFlagstat.txt
::::::::::::::
49080460 + 0 in total 
0 + 0 duplicates
6444347 + 0 mapped (13.13%:nan%)
49080460 + 0 paired in sequencing
45547041 + 0 read1  <--
3533419 + 0 read2   <--
5822436 + 0 properly paired (11.86%:nan%)
5822436 + 0 with itself and mate mapped
621911 + 0 singletons (1.27%:nan%)

The number mapped of read1 and read2 is different after the PCR duplicates were removed. Anyone here has the same situation?

I'm confused with these "paired in sequencing" and "properly paired" phrases, could anyone kindly please share with me your thoughts. The number shown for these two phrases are different.

picard pcr duplicates sam bam • 3.3k views
ADD COMMENT
1
Entering edit mode
12.0 years ago

Your results do look a bit strange ... as far as I know, the "read1" plus the "read2" value should always equal the "mapped" value. For you, the sum is equal to the "paired in sequencing" value instead. By the way, the read1 and read2 values do not need to be equal, in fact I have never seen it before. (Usually there are never exactly the same number of read1:s aligning as read2:s.)

"Paired in sequencing" is the number of paired reads among the total reads (usually equal to this number, although you could in principle have a mix of paired-end and single-end reads in a BAM/SAM file). "Properly paired" is the number of alignments where the "properly paired" SAM flag is set. This is done by the aligner, so it depends on the aligner how that is defined. Generally, it means that read 1 and read 2 align within some maximum distance of each other and in the correct orientation (if applicable).

ADD COMMENT
0
Entering edit mode

Thanks Mikael for the explanation.

I mapped the SOLiD csfasta reads against pseuodogenome (a collection of EST sequences of the Genus) as there is no complete genome available. It is a non-model plant species. I used SHRiMP2 to carry out the mapping task with --half-paired option on (default is on as of v2.2.0).

ADD REPLY
0
Entering edit mode
12.0 years ago

I'm not all that clear on what MARKDuplicates does with reads and read pairs where one or both ends don't map.

Maybe if Read 2 mapped much better than Read 1, maybe that's why MarkDuplicates took away so much more of it, and your read 1 data is full of unmapped reads that MarkDuplicates left alone.

You can use samtools view to disect how many read 1 and read 2's are properly paired versus just mapped versus unmapped.

ADD COMMENT
0
Entering edit mode

Thanks swbarnes2 for your answer.

Could you kindly please elaborate more about : "You can use samtools view to disect how many read 1 and read 2's are properly paired versus just mapped versus unmapped". Thanks.

I'm still in learning process to master the Samtools.

ADD REPLY

Login before adding your answer.

Traffic: 1633 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6