Question: Mark duplicates among multimappers
gravatar for mathieu.bahin
3.6 years ago by
mathieu.bahin40 wrote:

Hi all,

I have a sequencing dataset for which I want to remove the duplicates (PCR and optical). I aligned it with TopHat2 and STAR allowing a maximum of 5 locations for the multimappers.

I used MarkDuplicates to mark the duplicates but I happened to find that only primary alignments were marked (and I read that this is the expected behaviour).

Is there a reason why we can't mark seconday alignment reads? Is there an option to do that with MarkDuplicates that I would have missed? Otherwise, which tool can do that?




sequencing duplicates • 1.5k views
ADD COMMENTlink modified 3.5 years ago • written 3.6 years ago by mathieu.bahin40

First of all, why do u want to remove duplicate reads from rna-seq data ?

ADD REPLYlink written 3.6 years ago by geek_y9.1k

This is MeRIP data, I want to find info about methylated genes (and other features if so). After the mapping, I'll do peak calling. I think that duplicated reads will noise the results no?

ADD REPLYlink written 3.6 years ago by mathieu.bahin40

Removing multimapped reads should be okay but I am skeptical about removing duplicates.

ADD REPLYlink written 3.6 years ago by geek_y9.1k

Indeed I'd prefer to use only the uniquely mapped reads but then I'd keep only 3% of my dataset...

Thanks for the link on how both of the tools work.

From what I understand, secondary alignments are not even considered by MarkDuplicates. So my guess is that if several secondary alignment reads are mapped to chr1,10000,10100, they won't be marked as duplicate although they could be. So I don't see how I can work with MarkDuplicates as long as I decide to keep multimappers?

ADD REPLYlink written 3.5 years ago by mathieu.bahin40

I don't think programs marking duplicate reads considers wether they are primary alignment or secondary alignment. But I see your point that if primary aligned read is duplicate, then the secondary aligned reads also may be considered as duplicates. Quick alternative way I could think of is, mark the duplicate reads, get the read names of duplicate reads and then use that list to remove all the reads with same name. This way, removes all other alignments of a duplicate read.

ADD REPLYlink written 3.5 years ago by geek_y9.1k
gravatar for geek_y
3.6 years ago by
geek_y9.1k wrote:

Do not confuse with multi mapped reads and duplicate reads. Because the primary alignment may have mapped at chr1,100,200 and it could have been mapped at different positions like chr1,10000,10100. If there are multiple reads mapped at chr1,100,200 those will be marked as duplicates but if there are no other reads at chr1,10000,10100, it would not be treated as duplicate at  chr1,10000,10100 right ?  

What I would recommend is to use only uniquely mapped reads without removing duplicate reads.

See here how MarkDuplicates or rmdup works Picard MarkDuplicates and SamTools rmdup algorithm documentation

ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by geek_y9.1k

Sorry I should have added my reply here.

I considered that it doesn't take into account secondary alignments as none of them were flagged as duplicates (although I really have a lot of them). And I think I read it somewhere...

Thanks for the solution suggested, that's kind of what I thought doing. It's strange that I can do it in an easier way with largely used tools though.

ADD REPLYlink written 3.5 years ago by mathieu.bahin40
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2269 users visited in the last hour