Question: Removing Duplicates Reads (>= 5)
gravatar for vj
5.6 years ago by
vj400 wrote:


I am having trouble finding softwares that can remove duplicate reads from the aligned files (single-end sam/bam or bed files). I intend to keep the top "n" aligned reads (based on mapping quality) if there are more than "n" reads aligned to the same position. I tried picard but it marks any read as duplicate if there are two or more reads aligned to the same position and does not seem to have an option to provide "n". Is there any other software that I can use to accomplish this? I would like to keep n=5.

Thanks a lot!

duplicates picard samtools • 2.2k views
ADD COMMENTlink modified 5.6 years ago by lomereiter440 • written 5.6 years ago by vj400
gravatar for lomereiter
5.6 years ago by
Russian Federation
lomereiter440 wrote:

I encourage you to grab Picard source code and modify it to your needs. After all, that's exactly why we bioinformaticians prefer open source.
It only requires patching a couple of simple functions, markDuplicatePairs and markDuplicateFragments, where you need to do partial sort by score instead of taking single fragment/pair

ADD COMMENTlink written 5.6 years ago by lomereiter440

Thanks for the suggestion. As I do not have any Java skills I took the option Istvan Albert has suggested below. So the way I did is to scan through the sam file, and write out the list of the names of the reads that are aligned more than N times to the same position. Then I used the FilterSamReads (picard tools) to remove those reads out. I know it is not elegant but it does the job and pretty quickly as well.

ADD REPLYlink written 5.6 years ago by vj400
gravatar for Istvan Albert
5.6 years ago by
Istvan Albert ♦♦ 80k
University Park, USA
Istvan Albert ♦♦ 80k wrote:

There is no such tool that I know of. One reason could be that it would make data analysis even more subjective.

Keeping or removing duplicates has an established rationale and even that is often disputed. Keeping the first N reads that map to a location is even more subjective - in fact I would assume that you tacitly also imply that you want to keep the N best aligned reads at a given location and not just any N. And then of course what happens with paired end reads where only one of the pairs maps N times, or when the mapping qualities for pairs is different etc. would you break the pair, remove both, remove none etc

Your best bet is to come up with your own algorithm and write a simple script that reads a SAM file line by line, has a simple condition that implements your decisions and decides if it prints the line or not. You could use this to create a different SAM file out of what you have.

ADD COMMENTlink modified 5.6 years ago • written 5.6 years ago by Istvan Albert ♦♦ 80k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1565 users visited in the last hour