Question

Samtools Rmdup And Picard Mark Duplicates

2

Entering edit mode

11.3 years ago

Kssr ▴ 110

I ran FastQC and found 33 % duplication levels in my sample.It is single end data.The average coverage is 10x.So, I used samtools rmdup and picard mark duplicates and my duplication levels dropped to 1 %.I have few questions regarding removing duplicates:

1.Do both samtools and picard remove duplicates based on position alone?How is picard mark duplicates different from rmdup?(they give very similar results though).Just curious to know which one is better.

2.I am not sure if it advisable to remove duplicates from single end data and how do the above programs treat them.

3.When I run samtools rmdup it prints

[bam_rmdupse_core] 3566092 / 20492754 = 0.1740

My final dedup .bam has 20979669 reads.I don't get what value we are considering for denominator in the above case i.e.value 20492754.Any comments/suggestions appreciated.

samtools picard duplicates • 11k views

ADD COMMENT • link updated 11.3 years ago by Istvan Albert 100k • written 11.3 years ago by Kssr ▴ 110

1

Entering edit mode

also see this thread on seqanswers

ADD REPLY • link 11.3 years ago by Irsan ★ 7.8k

score 2 · Answer 1 · 2013-01-17

Short answer: duplicate identification may use sequence identity or mapping locations. But for the latter the read needs to be mapped to a location and unmapped reads are not processed. This explains a what you see. The optimal solution depends on many factors - the consensus seems to be the the picard markduplicates could be the best current solution.

The appropriateness of duplicate removal depends on coverage - one would want to only remove artificial duplicates and keep the natural duplicates.

There are many similar questions on Biostar - try search above.