Question: Samtools + Picard Markduplicates
gravatar for Nery
6.3 years ago by
Nery110 wrote:


I'm using Samtools to call variants and I am using Picard MarkDuplicates to mark duplicates in my bam file. My question is: Is it enough to just mark the duplicates using Picard and then use the marked bam file to call variants using Samtools or do the duplicates actually need to be removed from the bam file using the REMOVE_DUPLICATES=true argument in Picard? In other words, does Samtools pileup recognize the marked duplicates? Also, for calling the consensus sequence using Samtools, do I need to worry about duplicates or can I just use the original bam file without duplicates marked?

Thanks in advance for your help. (This group is great and I've learned a lot from it).

ADD COMMENTlink written 6.3 years ago by Nery110
gravatar for lh3
6.2 years ago by
United States
lh328k wrote:

It is safe to use MarkDuplicates+mpileup/pileup. By default, duplicates are ignored.

MarkDuplicates is "more correct" in the strict sense. Rmdup is more efficient simply because it does handle those tough cases. Rmdup works for single-end, too, but it cannot do paired-end and single-end at the same time. It does not work properly for mate-pair reads if read lengths are different.

As to SNP calling, the possibly best strategy is to do realignment with GATK, cap base quality by samtools, call SNPs with GATK and then recalibrate variant quality with GATK afterwards (i.e., you still need samtools for one step). In addition, for now, GATK is (arguably) less accurate than samtools for multi-sample low-coverage SNP calling, but the next version of UnifiedGenotyper is coming out which completely closes the gap because GATK and samtools will be using the same SNP calling model. As to variant quality recalibration, I am little cautious of its use of ts/tv. This is just a personal opinion, though, and perhaps biased.

ADD COMMENTlink written 6.2 years ago by lh328k
gravatar for Louis Letourneau
6.2 years ago by
Louis Letourneau800 wrote:

From my experience, the removal of duplicates does help with the consensus calling by removing the "artificial" coverage brought on by PCR dups.

Of course if you have a sample that has very deep coverage (>200x), then the duplicates might actually not be PCR duplicates...But that's a whole other set of problems.

As for rmdup vs MarkDuplciates:

"samtools rmdup does not remove interchromosomal duplicates. MarkDuplicates does remove these duplicates."

I'm not sure rmdup works on single reads (non-paired). That's what it stated awhile ago in the manual.

ADD COMMENTlink written 6.2 years ago by Louis Letourneau800

The 1000 genomes procedure seems to be duplicate removal by default and they have low coverage. It would be good to really see a specific comparison...

ADD REPLYlink written 6.2 years ago by Allpowerde1.2k
gravatar for Allpowerde
6.3 years ago by
Allpowerde1.2k wrote:

Given that the new samtool function rmdup (Version: 0.1.11) removes the duplicates, it's reasonable to assume that removing the duplicates with Picard is a saver option than just labeling them when doing variant detection with samtools. But why are you using samtools rather than GATK ?

ADD COMMENTlink written 6.3 years ago by Allpowerde1.2k

why should we use GATK rather than samtools for this process ? is there a bug in samtools ?

ADD REPLYlink written 6.3 years ago by Pierre Lindenbaum89k

HI Pierre, I have not used samtools for SNP calling myself but as far as I know the user has to employ hard thresholds to classify true SNPs from sequencing error (coverage>X, strandbias<Y) with samtools, while GATK offers a gaussian mixture model, trained on external data (dbSNP, Array,...), which might be better. Do you have more infos about samtools and SNP calling?

ADD REPLYlink written 6.2 years ago by Allpowerde1.2k

Hi all I need some information regarding the removal of PCR duplicates. I want to make a pipeline for exome sequencing data analysis. So I am doing a test run combining several pipelines that are available in the different blogs and trying to optimize it. As of now am working with paired end reads and I have selected a sample from 1000 genome project and trying to see how it goes. I have just used a single sample. And till now I have proceeded to the step qucikly sort the bam files after I convert the sam to bam format for the sample. I am a bit confused what to do next? As far as I know we can directly index the sorted bam file and then work with the GATK packages for identifying the target regions for realignment and then do the various calls for INDEL and SNPs. But in some pipelines I see they do remove the PCR duplicates by marking it first. Should I use the MarkPCR duplicates and then remove the duplicates and then go for the GATK realigner and other stuffs or shall I directly proceed with the GATK steps? Also since I donot have any normal samples in my test case. Its a sample from the 1000 genomes project so it should be nice to tell me how to do the variant calling with GATK , should my vcf file be compared with the dbSNP vcf and other public vcf that are present ? Please reply it would be of great help.

ADD REPLYlink written 3.4 years ago by vchris_ngs2.7k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1726 users visited in the last hour