Question: Samtools: removing PCR duplicates
0
gravatar for devinliao0918
6.3 years ago by
United States
devinliao091830 wrote:

Could anyone explain the difference between the options -s and -S for "samtools rmdup"? In addition, is it a standard to use -sS in order to remove duplicate reads?

I recently tried to remove the duplicates in one Bam file. After running the command line "samtools rmdup -sS in.nameSrt.bam out .bam", the size of Bam file decreased from 11G to 5.2G and the log showed that there were 52.48% reads that had been removed. I'm really worried about the massive amount of data loss.

By the way, one of my goal in the downstream analysis is to call genotypes and detect SNPs.

next-gen • 11k views
ADD COMMENTlink modified 21 months ago by Biostar ♦♦ 20 • written 6.3 years ago by devinliao091830
0
gravatar for dariober
6.3 years ago by
dariober11k
WCIP | Glasgow | UK
dariober11k wrote:

I would recommend using picard MarkDuplicates. See also http://seqanswers.com/forums/showthread.php?t=5424.

High duplication might expected if you sequenced quite deep. As and extreme, duplication in RNA-Seq is quite high, but it's expected. I would look at some regions on a genome browser (e.g. IGV) to have a feel of whether reads are nicely uniformly spread or tend to be clustered in stack of reads, which would suggest over-amplification.

ADD COMMENTlink modified 12 months ago by RamRS30k • written 6.3 years ago by dariober11k

the question is for single-ended (-s). does MarkDuplicate works with SE ?

ADD REPLYlink written 6.3 years ago by Pierre Lindenbaum131k

Do you know the difference between the two options -s and -S? Could I use -s only to avoid too much data loss?

ADD REPLYlink written 6.3 years ago by devinliao091830

are you really using single-end data ?

ADD REPLYlink written 6.3 years ago by Pierre Lindenbaum131k

No, the sequencing is done using Illumina HiSeq 2000 which should generate paired-end data.

ADD REPLYlink written 6.3 years ago by devinliao091830

so you don't have to deal with those options -s and you should use MarkDuplicates. http://samtools.sourceforge.net/samtools.shtml Samtools paired-end rmdup does not work for unpaired reads (e.g. orphan reads or ends mapped to different chromosomes). If this is a concern, please use Picard's MarkDuplicate which correctly handles these cases, although a little slower.

ADD REPLYlink modified 12 months ago by RamRS30k • written 6.3 years ago by Pierre Lindenbaum131k

I would like to try MarkDuplicates if I could. However, I need to process thousands of Bam files and my pipeline relies heavily on Samtools.

ADD REPLYlink written 6.3 years ago by devinliao091830

In principle I don't see any problem in passing a file to MarkDuplicates instead of samtools. Maybe you should give more detail of your pipeline. 

ADD REPLYlink written 6.3 years ago by dariober11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1115 users visited in the last hour