Removing Duplicates Reads (>= 5)
2
1
Entering edit mode
8.1 years ago
vj ▴ 480

Hello,

I am having trouble finding softwares that can remove duplicate reads from the aligned files (single-end sam/bam or bed files). I intend to keep the top "n" aligned reads (based on mapping quality) if there are more than "n" reads aligned to the same position. I tried picard but it marks any read as duplicate if there are two or more reads aligned to the same position and does not seem to have an option to provide "n". Is there any other software that I can use to accomplish this? I would like to keep n=5.

Thanks a lot!

duplicates picard samtools • 2.8k views
ADD COMMENT
2
Entering edit mode
8.1 years ago
lomereiter ▴ 470

I encourage you to grab Picard source code and modify it to your needs. After all, that's exactly why we bioinformaticians prefer open source.
It only requires patching a couple of simple functions, markDuplicatePairs and markDuplicateFragments, where you need to do partial sort by score instead of taking single fragment/pair
http://sourceforge.net/p/picard/code/1615/tree/trunk/src/java/net/sf/picard/sam/MarkDuplicates.java#l656
http://sourceforge.net/p/picard/code/1615/tree/trunk/src/java/net/sf/picard/sam/MarkDuplicates.java#l697

ADD COMMENT
0
Entering edit mode

Thanks for the suggestion. As I do not have any Java skills I took the option Istvan Albert has suggested below. So the way I did is to scan through the sam file, and write out the list of the names of the reads that are aligned more than N times to the same position. Then I used the FilterSamReads (picard tools) to remove those reads out. I know it is not elegant but it does the job and pretty quickly as well.

ADD REPLY
0
Entering edit mode
8.1 years ago

There is no such tool that I know of. One reason could be that it would make data analysis even more subjective.

Keeping or removing duplicates has an established rationale and even that is often disputed. Keeping the first N reads that map to a location is even more subjective - in fact I would assume that you tacitly also imply that you want to keep the N best aligned reads at a given location and not just any N. And then of course what happens with paired end reads where only one of the pairs maps N times, or when the mapping qualities for pairs is different etc. would you break the pair, remove both, remove none etc

Your best bet is to come up with your own algorithm and write a simple script that reads a SAM file line by line, has a simple condition that implements your decisions and decides if it prints the line or not. You could use this to create a different SAM file out of what you have.

ADD COMMENT

Login before adding your answer.

Traffic: 2481 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6