Question: GATK MuTect2 duplication filter
gravatar for umn_bist
4.2 years ago by
umn_bist370 wrote:

So I have a set of tumor:matched normal samples. I have them deduped with picard for PCR contamination. Afterwards I use MuTect2 to call somatic variants against dbSNP, COSMIC coding mutation, COSMIC noncoding mutation. And for some reason about 10% of my reads are being filtered out as duplicates.

I suspect that these "duplicates" are not contaminants and was wondering what may be going on. Could it be rRNA that were not trimmed during pre-processing QC?

rna-seq gatk • 3.3k views
ADD COMMENTlink modified 4.2 years ago by DG7.1k • written 4.2 years ago by umn_bist370
gravatar for DG
4.2 years ago by
DG7.1k wrote:

picard de-duplication is based almost entirely on reads mapping to identical start/end points. A 10% duplication rate is high, but wouldn't be totally out of the ordinary either. You have your post tagged as RNA-Seq and mention rRNA but your workflow with Mutect2 reads more like a DNA-Seq alignment and processing workflow. Are you calling variants from your RNA-Seq data? Given how Picard deduplicates data it tends to grossly overestimate duplicated reads when dealing with RNA-Seq data so it is usually skipped. Some clarification about the source of your data and the type of experiment might clear things up further.

ADD COMMENTlink written 4.2 years ago by DG7.1k

Of course. I am working with RNA-seq data from TCGA for calling somatic variant calling using Mutect2. I have both tumor and its matched normals.

From my understanding, deduping was encouraged (according to Broad/GATK) to remove PCR contaminants. I am not trimming but simply marking my duplicates.

Could it be that the duplicates that are being filtered by MuTect2 are actually my marked duplicates?

ADD REPLYlink written 4.2 years ago by umn_bist370

Unless you specify not to MuTect2 automatically filters marked duplicates. That's the whole point of marking duplicates is so they aren't considered by downstream variant callers and other tools.The GATK/Broad best-practices documents is primarily geared towards working with DNA sequencing data. Many of the steps have not been validated when working with RNA. RNA-based variant calling has always been considered a little bit more problematic than DNA-based results as the underlying error rate is higher for individual nucleotides.

ADD REPLYlink written 4.2 years ago by DG7.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1894 users visited in the last hour