Question

GATK Terra Best Practices pipeline for Mito Variant Calling

1

Entering edit mode

4.6 years ago

a.j.wilson0000 ▴ 10

Hey all.

Currently working on a project to do mitochondrial variant calling on whole exome data. Our probes are about 120 bp and are setup to capture the entirety of the chrM contig at extremely high depth.

We're currently looking at a few different tools, and the new GATK best practices MUTECT2 mito pipeline that incorporates a double alignment strategy looks very promising. The thing is, the --mitochondria-mode tag is brand spanking new, and there just isn't a lot of documentation or usage examples for replicating the pipeline at the command line. I've tried contacting support a few times, but GATK support is quite understaffed at the moment.

Thinking instead that we might use their fully built TERRA cloud pipeline (found here) to generate our vcfs, but the pipeline docs indicate that the workflow is configured for full WGS bam/crams, not WES bam/crams.

People who are familiar with TERRA and variant calling... do you think it is possible to do WES on this workflow? What required inputs would need to change? I'm guessing a few of the interval lists?

Also, there are quite a few optional inputs you can use. Can anyone suggest some I might want to use besides setting a vaf_filter_threshold?

gatk terra WES variant calling • 2.4k views

ADD COMMENT • link updated 4.5 years ago by houkto ▴ 80 • written 4.6 years ago by a.j.wilson0000 ▴ 10

0

Entering edit mode

Before suggesting anything else, did you check if your data have a solid number of alignments to chrM? Typical genomic DNA extraction kits do capture circular mtDNA, neither do typical exome kits capture it out of the DNA pool. Samtools idxstats would be a fast way to check.

ADD REPLY • link 4.6 years ago by ATpoint 82k

1

Entering edit mode

4.5 years ago

houkto ▴ 80

I worked on regular WES kits from agilent (v5) and illumine (nextera) both were able to capture the mitochondria DNA. We also QC the results with Sanger sequencing. I do suggest you stick with WDL pipeline as it does answers a lot of the reviewers question if you want to publish your mitochondria work i.e Numt, haploid & circler DNA, heteroplasmy and false positive variants. These issues are challenged with higher criteria for mapping reads to regular and shifted mitochondria reference genomes (after regular mapping to the whole genome/exome) to adjust reads for circular DNA and to avoid numt, contamination which is also done by taking into account the WES/WGS autosomal coverage into an account when calling mitochondria DNA variants. Sensitive detection of heteroplasmic variants while blacklisting known artifacts variants when calling mitochondria DNA variants. the disadvantage of this caller is it doesn't allow for multisamples calling :/

ADD COMMENT • link 4.5 years ago by houkto ▴ 80

0

Entering edit mode

hi houkto, I am also going to work on WES mito variant calling. I read about the forum and wdl of GATK best practice. While I am confusing about the shifted chrM fasta, do I need to generate it by myself? For I could not just download from their web because the path is wrong when I used the one in json file. Also their pipeline is based on hg38, and I use hg19. I think this doesn't matter because we just focus on chrM fasta, right? And should I use rCRS fasta as ref?

ADD REPLY • link 4.1 years ago by lovelysunzt • 0

score 2 · Accepted Answer · 2019-09-20

Currently working on a project to do mitochondrial variant calling on whole exome data. Our probes are about 120 bp and are setup to capture the entirety of the chrM contig at extremely high depth.

As I understand it a typical exome kit doesn't include mtDNA probes because the kit would be overwhelmed by the number of mtdna compared to nuclear dna and any attempt to include probes can also end up amplifying NUMTs. Maybe you have a better solution or more specific probes.

Thinking instead that we might use their fully built TERRA cloud pipeline (found here) to generate our vcfs, but the pipeline docs indicate that the workflow is configured for full WGS bam/crams, not WES bam/crams.

WGS doesn't suffer as much from the problems I mentioned above (though NUMTs are still a thing), so they just figured you would have done it this way. This first step in this TERRA WDL workflow just filters for chrM anyway. It's about as crude as you'd imagine and would behave the same for any bam file.

  task SubsetBamToChrM {
  input {
    File input_bam
    File input_bai
    String contig_name
    String basename = basename(basename(input_bam, ".cram"), ".bam")
    File? ref_fasta
    File? ref_fasta_index
    File? ref_dict

    File? gatk_override

    # runtime
    Int? preemptible_tries
  }
  Float ref_size = if defined(ref_fasta) then size(ref_fasta, "GB") + size(ref_fasta_index, "GB") + size(ref_dict, "GB") else 0
  Int disk_size = ceil(size(input_bam, "GB") + ref_size) + 20

  meta {
    description: "Subsets a whole genome bam to just Mitochondria reads"
  }
  parameter_meta {
    ref_fasta: "Reference is only required for cram input. If it is provided ref_fasta_index and ref_dict are also required."
    input_bam: {
      localization_optional: true
    }
    input_bai: {
      localization_optional: true
    }
  }
  command <<<
    set -e
    export GATK_LOCAL_JAR=~{default="/root/gatk.jar" gatk_override}

    gatk PrintReads \
      ~{"-R " + ref_fasta} \
      -L ~{contig_name} \
      --read-filter MateOnSameContigOrNoMappedMateReadFilter \
      --read-filter MateUnmappedAndUnmappedReadFilter \
      -I ~{input_bam} \
      -O ~{basename}.bam
  >>>
  runtime {
    memory: "3 GB"
    disks: "local-disk " + disk_size + " HDD"
    docker: "us.gcr.io/broad-gatk/gatk:4.1.1.0"
    preemptible: select_first([preemptible_tries, 5])
  }
  output {
    File output_bam = "~{basename}.bam"
    File output_bai = "~{basename}.bai"
  }
}