Question

GATK MuTect2 duplication filter

0

Entering edit mode

9.3 years ago

umn_bist ▴ 390

So I have a set of tumor:matched normal samples. I have them deduped with picard for PCR contamination. Afterwards I use MuTect2 to call somatic variants against dbSNP, COSMIC coding mutation, COSMIC noncoding mutation. And for some reason about 10% of my reads are being filtered out as duplicates.

I suspect that these "duplicates" are not contaminants and was wondering what may be going on. Could it be rRNA that were not trimmed during pre-processing QC?

RNA-Seq GATK • 5.8k views

ADD COMMENT • link updated 9.3 years ago by DG 7.3k • written 9.3 years ago by umn_bist ▴ 390

score 1 · Answer 1 · 2016-03-24

1

Entering edit mode

9.3 years ago

DG 7.3k

picard de-duplication is based almost entirely on reads mapping to identical start/end points. A 10% duplication rate is high, but wouldn't be totally out of the ordinary either. You have your post tagged as RNA-Seq and mention rRNA but your workflow with Mutect2 reads more like a DNA-Seq alignment and processing workflow. Are you calling variants from your RNA-Seq data? Given how Picard deduplicates data it tends to grossly overestimate duplicated reads when dealing with RNA-Seq data so it is usually skipped. Some clarification about the source of your data and the type of experiment might clear things up further.

ADD COMMENT • link 9.3 years ago by DG 7.3k

0

Entering edit mode

Of course. I am working with RNA-seq data from TCGA for calling somatic variant calling using Mutect2. I have both tumor and its matched normals.

From my understanding, deduping was encouraged (according to Broad/GATK) to remove PCR contaminants. I am not trimming but simply marking my duplicates.

Could it be that the duplicates that are being filtered by MuTect2 are actually my marked duplicates?

ADD REPLY • link 9.3 years ago by umn_bist ▴ 390

0

Entering edit mode

Unless you specify not to MuTect2 automatically filters marked duplicates. That's the whole point of marking duplicates is so they aren't considered by downstream variant callers and other tools.The GATK/Broad best-practices documents is primarily geared towards working with DNA sequencing data. Many of the steps have not been validated when working with RNA. RNA-based variant calling has always been considered a little bit more problematic than DNA-based results as the underlying error rate is higher for individual nucleotides.

ADD REPLY • link 9.3 years ago by DG 7.3k