Hi all,
I have a sequencing dataset for which I want to remove the duplicates (PCR and optical). I aligned it with TopHat2 and STAR allowing a maximum of 5 locations for the multimappers.
I used MarkDuplicates to mark the duplicates but I happened to find that only primary alignments were marked (and I read that this is the expected behaviour).
Is there a reason why we can't mark secondary alignment reads? Is there an option to do that with MarkDuplicates that I would have missed? Otherwise, which tool can do that?
Cheers,
Mathieu
First of all, why do you want to remove duplicate reads from RNA-seq data?
This is MeRIP data, I want to find info about methylated genes (and other features if so). After the mapping, I'll do peak calling. I think that duplicated reads will noise the results no?
Removing multimapped reads should be okay but I am skeptical about removing duplicates.
Indeed I'd prefer to use only the uniquely mapped reads but then I'd keep only 3% of my dataset...
Thanks for the link on how both of the tools work.
From what I understand, secondary alignments are not even considered by MarkDuplicates. So my guess is that if several secondary alignment reads are mapped to
chr1,10000,10100
, they won't be marked as duplicate although they could be. So I don't see how I can work with MarkDuplicates as long as I decide to keep multimappers?I don't think programs marking duplicate reads considers whether they are primary alignment or secondary alignment. But I see your point that if primary aligned read is duplicate, then the secondary aligned reads also may be considered as duplicates. Quick alternative way I could think of is, mark the duplicate reads, get the read names of duplicate reads and then use that list to remove all the reads with same name. This way, removes all other alignments of a duplicate read.