Question: how do aligners determine which reads to apply duplicate reads flag (1024)?
0
gravatar for epigene
4.1 years ago by
epigene490
United States
epigene490 wrote:

I was trying to understand how aligners, specifically bwa, determine which reads to apply the 1024 flag in the sam output.

I notice I have cases where two identical reads (same read seq and same quality scores) are mapped to different locations on yeast chr16 as shown below:

Mxxx:1:2103:26709:7555     0       chr16   560671  0       90M     *       0       0       ATTAGTATGTAGAAATATAGATTCCATTTTGAGGATTCCTATATCCTCGAGGAGAACTTCTAGTATATTCTGTATACCTAATATTATAGC  CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG      MD:Z:90 PG:Z:MarkDuplicates     NM:i:0  AS:i:90     XS:i:90

Mxxx:1:2108:18700:2139     0       chr16   844608  0       90M     *       0       0       ATTAGTATGTAGAAATATAGATTCCATTTTGAGGATTCCTATATCCTCGAGGAGAACTTCTAGTATATTCTGTATACCTAATATTATAGC  CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG      MD:Z:90 PG:Z:MarkDuplicates     NM:i:0  AS:i:90     XS:i:90

Why are they not considered duplicates? And how are duplicates determined?

bwa sam • 1.2k views
ADD COMMENTlink modified 4.1 years ago by dariober11k • written 4.1 years ago by epigene490
4

What makes you think these two reads have the same read name?

One is Mxxx:1:2103:26709:7555, and one is Mxxx:1:2108:18700:2139.

Duplicate calling is usually not done by the mapper like BWA, but rather done later by a duplicate-marking tool, in your case it looks like Picard's MarkDuplicates due to PG:Z:MarkDuplicates. This tool uses not the DNA sequence but the read's mapped position to decide if it's duplicate or not, and these reads despite having the same DNA have different mapped positions.

This is likely because the reads have a MAPQ of 0, which for BWA specifically means that the read maps in multiple locations. In other words, this could very well be a PCR duplicate, however, since it also maps to multiple places on the genome and BWA randomly assigned those reads to different places, they weren't called duplicates by Picard.

I agree this is a really dumb problem to be having but this is how it goes.

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by John12k

I didn't say they have the same read name..and they don't. I meant for the seq and quality.

Yeah, after BWA, the bam file was feed to Picard's MarkDuplicates tool. Thanks for the explanations.

ADD REPLYlink written 4.1 years ago by epigene490
1

Right, sorry, i'm just trying to help clarify some terminology as often different uses of certain words leads to a lot of confusion down the line. For example, you didn't say "identical read names" you said "identical reads", however a "read" is not defined by the sequenced DNA (which was implied), otherwise PCR duplicates or legitimate biological duplicates would all have the same read name. It is defined by where on the flowcell the read came from, (or more generally, an individual "sequencing event" in the sequencer). Thus, when you said "two identical reads" I was expecting to see two entries from the same sequencing event -- which happens when a mapper like BWA also outputs secondary alignments.

Maybe you already knew all that, but when you spend a lot of time on a support forum trying to help people, it becomes a habit to spell everything out just to make sure everyone is on the same page. I really didn't intend to frustrate you or anything like that :)

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by John12k

I understand.. I should have been more careful with the right terms as there are important distinctions.

ADD REPLYlink written 4.1 years ago by epigene490
0
gravatar for dariober
4.1 years ago by
dariober11k
WCIP | Glasgow | UK
dariober11k wrote:

So your question is effectively about picard MarkDuplicates and how it works. Have seen this FAQ page http://broadinstitute.github.io/picard/faq.html?

ADD COMMENTlink written 4.1 years ago by dariober11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1267 users visited in the last hour