Samtools: removing PCR duplicates
1
0
Entering edit mode
9.8 years ago

Could anyone explain the difference between the options -s and -S for "samtools rmdup"? In addition, is it a standard to use -sS in order to remove duplicate reads?

I recently tried to remove the duplicates in one Bam file. After running the command line "samtools rmdup -sS in.nameSrt.bam out .bam", the size of Bam file decreased from 11G to 5.2G and the log showed that there were 52.48% reads that had been removed. I'm really worried about the massive amount of data loss.

By the way, one of my goal in the downstream analysis is to call genotypes and detect SNPs.

next-gen • 13k views
ADD COMMENT
0
Entering edit mode
9.8 years ago

I would recommend using picard MarkDuplicates. See also http://seqanswers.com/forums/showthread.php?t=5424.

High duplication might expected if you sequenced quite deep. As and extreme, duplication in RNA-Seq is quite high, but it's expected. I would look at some regions on a genome browser (e.g. IGV) to have a feel of whether reads are nicely uniformly spread or tend to be clustered in stack of reads, which would suggest over-amplification.

ADD COMMENT
0
Entering edit mode

the question is for single-ended (-s). does MarkDuplicate works with SE ?

ADD REPLY
0
Entering edit mode

Do you know the difference between the two options -s and -S? Could I use -s only to avoid too much data loss?

ADD REPLY
0
Entering edit mode

are you really using single-end data ?

ADD REPLY
0
Entering edit mode

No, the sequencing is done using Illumina HiSeq 2000 which should generate paired-end data.

ADD REPLY
0
Entering edit mode

so you don't have to deal with those options -s and you should use MarkDuplicates. http://samtools.sourceforge.net/samtools.shtml Samtools paired-end rmdup does not work for unpaired reads (e.g. orphan reads or ends mapped to different chromosomes). If this is a concern, please use Picard's MarkDuplicate which correctly handles these cases, although a little slower.

ADD REPLY
0
Entering edit mode

I would like to try MarkDuplicates if I could. However, I need to process thousands of Bam files and my pipeline relies heavily on Samtools.

ADD REPLY
0
Entering edit mode

In principle I don't see any problem in passing a file to MarkDuplicates instead of samtools. Maybe you should give more detail of your pipeline.

ADD REPLY

Login before adding your answer.

Traffic: 3252 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6