Question: GATK Terra Best Practices pipeline for Mito Variant Calling
1
gravatar for a.j.wilson0000
9 months ago by
Johns Hopkins, Baltimore, USA
a.j.wilson000010 wrote:

Hey all.

Currently working on a project to do mitochondrial variant calling on whole exome data. Our probes are about 120 bp and are setup to capture the entirety of the chrM contig at extremely high depth.

We're currently looking at a few different tools, and the new GATK best practices MUTECT2 mito pipeline that incorporates a double alignment strategy looks very promising. The thing is, the --mitochondria-mode tag is brand spanking new, and there just isn't a lot of documentation or usage examples for replicating the pipeline at the command line. I've tried contacting support a few times, but GATK support is quite understaffed at the moment.

Thinking instead that we might use their fully built TERRA cloud pipeline (found here) to generate our vcfs, but the pipeline docs indicate that the workflow is configured for full WGS bam/crams, not WES bam/crams.

People who are familiar with TERRA and variant calling... do you think it is possible to do WES on this workflow? What required inputs would need to change? I'm guessing a few of the interval lists?

Also, there are quite a few optional inputs you can use. Can anyone suggest some I might want to use besides setting a vaf_filter_threshold?

ADD COMMENTlink modified 8 months ago by houkto80 • written 9 months ago by a.j.wilson000010

Before suggesting anything else, did you check if your data have a solid number of alignments to chrM? Typical genomic DNA extraction kits do capture circular mtDNA, neither do typical exome kits capture it out of the DNA pool. Samtools idxstats would be a fast way to check.

ADD REPLYlink modified 9 months ago • written 9 months ago by ATpoint36k
2
gravatar for Jeremy Leipzig
9 months ago by
Philadelphia, PA
Jeremy Leipzig19k wrote:

Currently working on a project to do mitochondrial variant calling on whole exome data. Our probes are about 120 bp and are setup to capture the entirety of the chrM contig at extremely high depth.

As I understand it a typical exome kit doesn't include mtDNA probes because the kit would be overwhelmed by the number of mtdna compared to nuclear dna and any attempt to include probes can also end up amplifying NUMTs. Maybe you have a better solution or more specific probes.

Thinking instead that we might use their fully built TERRA cloud pipeline (found here) to generate our vcfs, but the pipeline docs indicate that the workflow is configured for full WGS bam/crams, not WES bam/crams.

WGS doesn't suffer as much from the problems I mentioned above (though NUMTs are still a thing), so they just figured you would have done it this way. This first step in this TERRA WDL workflow just filters for chrM anyway. It's about as crude as you'd imagine and would behave the same for any bam file.

  task SubsetBamToChrM {
  input {
    File input_bam
    File input_bai
    String contig_name
    String basename = basename(basename(input_bam, ".cram"), ".bam")
    File? ref_fasta
    File? ref_fasta_index
    File? ref_dict

    File? gatk_override

    # runtime
    Int? preemptible_tries
  }
  Float ref_size = if defined(ref_fasta) then size(ref_fasta, "GB") + size(ref_fasta_index, "GB") + size(ref_dict, "GB") else 0
  Int disk_size = ceil(size(input_bam, "GB") + ref_size) + 20

  meta {
    description: "Subsets a whole genome bam to just Mitochondria reads"
  }
  parameter_meta {
    ref_fasta: "Reference is only required for cram input. If it is provided ref_fasta_index and ref_dict are also required."
    input_bam: {
      localization_optional: true
    }
    input_bai: {
      localization_optional: true
    }
  }
  command <<<
    set -e
    export GATK_LOCAL_JAR=~{default="/root/gatk.jar" gatk_override}

    gatk PrintReads \
      ~{"-R " + ref_fasta} \
      -L ~{contig_name} \
      --read-filter MateOnSameContigOrNoMappedMateReadFilter \
      --read-filter MateUnmappedAndUnmappedReadFilter \
      -I ~{input_bam} \
      -O ~{basename}.bam
  >>>
  runtime {
    memory: "3 GB"
    disks: "local-disk " + disk_size + " HDD"
    docker: "us.gcr.io/broad-gatk/gatk:4.1.1.0"
    preemptible: select_first([preemptible_tries, 5])
  }
  output {
    File output_bam = "~{basename}.bam"
    File output_bai = "~{basename}.bai"
  }
}
ADD COMMENTlink modified 9 months ago • written 9 months ago by Jeremy Leipzig19k

I'm fairly certain it's all been sorted on the lab side. We have new TWIST mito probes custom tailored to the rCRF that can pull out chrM piecemeal without disrupting the standard WES process.

Auto filtering of amplified NuMTs is apparently done with FilterMutectCalls along with the default settings for the --mitochondrial-mode tag.

ADD REPLYlink written 9 months ago by a.j.wilson000010
1
gravatar for houkto
8 months ago by
houkto80
UK
houkto80 wrote:

I worked on regular WES kits from agilent (v5) and illumine (nextera) both were able to capture the mitochondria DNA. We also QC the results with Sanger sequencing. I do suggest you stick with WDL pipeline as it does answers a lot of the reviewers question if you want to publish your mitochondria work i.e Numt, haploid & circler DNA, heteroplasmy and false positive variants. These issues are challenged with higher criteria for mapping reads to regular and shifted mitochondria reference genomes (after regular mapping to the whole genome/exome) to adjust reads for circular DNA and to avoid numt, contamination which is also done by taking into account the WES/WGS autosomal coverage into an account when calling mitochondria DNA variants. Sensitive detection of heteroplasmic variants while blacklisting known artifacts variants when calling mitochondria DNA variants. the disadvantage of this caller is it doesn't allow for multisamples calling :/

ADD COMMENTlink written 8 months ago by houkto80

hi houkto, I am also going to work on WES mito variant calling. I read about the forum and wdl of GATK best practice. While I am confusing about the shifted chrM fasta, do I need to generate it by myself? For I could not just download from their web because the path is wrong when I used the one in json file. Also their pipeline is based on hg38, and I use hg19. I think this doesn't matter because we just focus on chrM fasta, right? And should I use rCRS fasta as ref?

ADD REPLYlink written 3 months ago by lovelysunzt0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1644 users visited in the last hour