Question: Trouble with best practices
gravatar for erarroji
19 months ago by
erarroji0 wrote:


I'am new with all NGS tools, I tried to follow best practices workflow for my exome reads from tumor samples: I align with bwa mem, sort with samtools, mark duplicates and recalibrated bases with GATK 4 and finally call variants with Mutect 2. Although the software run successfully I noticed that the mark duplicates metrics only have a 0.3% of duplicates, which I double check with samtools flagstat. So I marked duplicates with samtools, from the same bam, which give a 30% of duplicates. Later when I did the variant calling the output from the bam mark with samtools was of 297061 mutations againts 50370 from the bam mark with GATK. I am not sure which file is the right one. What could I be doing wrong? How can I make sure which file is worked correctly?.

Thank you. Ernesto Rojas

ADD COMMENTlink written 19 months ago by erarroji0

One should never expect different variant calling pipelines to come up with the same number of variants, of course. Each program has its own idea about which QC thresholds are important. You have neither explained in detail the steps that you have taken (with code), so, any comments here are going to be just speculative.

As per Istvan, if you're new to this, then better to trust the GATK calls for now.

ADD REPLYlink modified 17 months ago • written 19 months ago by Kevin Blighe48k

Thank you, I will trust GATK. The variant calling was made with the same tool (Mutect 2), I only changed the markduplicates tools. In all cases y used default values, could that be what is generating troubles?

ADD REPLYlink written 19 months ago by erarroji0

Ah, so, you called variants with Mutect2 in both situations; however, in one pipeline you removed duplicates with samtools rmdup?

Edit: I would go by the BAM with duplicates marked by Picard MarkDuplicates.

ADD REPLYlink modified 17 months ago • written 19 months ago by Kevin Blighe48k

first, make sure that the numbers for duplicates are really off, 0.3 could be 30% if expressed as a fraction.

As for the mutations, I would trust the ones produced with GATK. It probably has more corrections built into it that remove more false positives.

Welcome to bioinformatics :-)

ADD REPLYlink modified 19 months ago • written 19 months ago by Istvan Albert ♦♦ 81k

Thank you, I did checked the duplicates, they were as fraction and also I checked it with samtools flagstat making the calculation with the raw values.

ADD REPLYlink written 19 months ago by erarroji0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1182 users visited in the last hour