Mark duplicates among multimappers
1
0
Entering edit mode
8.7 years ago

Hi all,

I have a sequencing dataset for which I want to remove the duplicates (PCR and optical). I aligned it with TopHat2 and STAR allowing a maximum of 5 locations for the multimappers.

I used MarkDuplicates to mark the duplicates but I happened to find that only primary alignments were marked (and I read that this is the expected behaviour).

Is there a reason why we can't mark secondary alignment reads? Is there an option to do that with MarkDuplicates that I would have missed? Otherwise, which tool can do that?

Cheers,
Mathieu

sequencing duplicates • 4.1k views
ADD COMMENT
0
Entering edit mode

First of all, why do you want to remove duplicate reads from RNA-seq data?

ADD REPLY
0
Entering edit mode

This is MeRIP data, I want to find info about methylated genes (and other features if so). After the mapping, I'll do peak calling. I think that duplicated reads will noise the results no?

ADD REPLY
0
Entering edit mode

Removing multimapped reads should be okay but I am skeptical about removing duplicates.

ADD REPLY
0
Entering edit mode

Indeed I'd prefer to use only the uniquely mapped reads but then I'd keep only 3% of my dataset...

Thanks for the link on how both of the tools work.

From what I understand, secondary alignments are not even considered by MarkDuplicates. So my guess is that if several secondary alignment reads are mapped to chr1,10000,10100, they won't be marked as duplicate although they could be. So I don't see how I can work with MarkDuplicates as long as I decide to keep multimappers?

ADD REPLY
0
Entering edit mode

I don't think programs marking duplicate reads considers whether they are primary alignment or secondary alignment. But I see your point that if primary aligned read is duplicate, then the secondary aligned reads also may be considered as duplicates. Quick alternative way I could think of is, mark the duplicate reads, get the read names of duplicate reads and then use that list to remove all the reads with same name. This way, removes all other alignments of a duplicate read.

ADD REPLY
2
Entering edit mode
8.7 years ago

Do not confuse with multi mapped reads and duplicate reads. Because the primary alignment may have mapped at chr1,100,200 and it could have been mapped at different positions like chr1,10000,10100. If there are multiple reads mapped at chr1,100,200 those will be marked as duplicates but if there are no other reads at chr1,10000,10100, it would not be treated as duplicate at chr1,10000,10100 right?

What I would recommend is to use only uniquely mapped reads without removing duplicate reads.

See here how MarkDuplicates or rmdup works Picard MarkDuplicates and SamTools rmdup algorithm documentation

ADD COMMENT
0
Entering edit mode

Sorry I should have added my reply here.

I considered that it doesn't take into account secondary alignments as none of them were flagged as duplicates (although I really have a lot of them). And I think I read it somewhere...

Thanks for the solution suggested, that's kind of what I thought doing. It's strange that I can do it in an easier way with largely used tools though.

ADD REPLY

Login before adding your answer.

Traffic: 2701 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6