Question

Tranche Plot after VSQR

0

Entering edit mode

4.7 years ago

BAGeno ▴ 190

Hi

I am doing exome sequencing of 15 patients. After variant calling I run VariantRecalibrator using this command

gatk VariantRecalibrator -R new_hg38.fa \
  -V allsample.vcf \
 --resource hapmap,known=false,training=true,truth=true,prior=15.0:hapmap_3.3.hg38.vcf.gz \
--resource omni,known=false,training=true,truth=false,prior=12.0:omni2.5.hg38.vcf.gz \
--resource 1000G,known=false,training=true,truth=false,prior=10.0:thouzndG_phase1.snps.high_confidence.hg38.vcf.gz  \
--resource dbsnp,known=true,training=false,truth=false,prior=2.0:All_20180418.chr.hg38.vcf.gz \
-an QD \
-an MQ \
-an MQRankSum \
-an ReadPosRankSum \
-an FS \
-an SOR \
-mode SNP \
-O output.snp.recal  \
--tranches-file output.snp.tranches \
--rscript-file output.snp..plots.R

Can you please tell me whether this tranche graph is accurate or not?

gatk4 VariantReclibration • 2.6k views

ADD COMMENT • link updated 4.7 years ago by Nicolas Rosewick 11k • written 4.7 years ago by BAGeno ▴ 190

0

Entering edit mode

Could you add GATK version you used. And also the previous steps you did. Did you used an interval file ? Tranche plot is not very good IMO.

ADD REPLY • link 4.7 years ago by Nicolas Rosewick 11k

0

Entering edit mode

I have used GATK 4.0.8. I run HaplotypeCaller, CombineGVCFs, and GenotypeGVCFs. I did not use interval file. So what should I do for this?

ADD REPLY • link 4.7 years ago by BAGeno ▴ 190

score 4 · Answer 1 · 2019-08-23

For exome sequencing datasets you should always use an interval file. As explained on GATK website (https://software.broadinstitute.org/gatk/documentation/article?id=11009-

For exomes and similarly targeted data types, the interval list should correspond to the capture targets used for the library prep, and is typically provided by the prep kit manufacturer (with versions for each ref genome build of course).

Thus you should at each step where an -L option is available use the corresponding interval file. The exome kit vendor should provide you with the corresponding interval file. If not you can use gencode :

wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/gencode.v31.annotation.gtf.gz
zcat gencode.v31.annotation.gtf.gz | awk '$3 == "exon" { print $1, $4, $5, $7, $18}' | tr ' ' '\t' | tr -d '";' >> interval_list.bed

For information if you use GATK4 best practices these are the tools where the interval file needs to be specified

BaseRecalibrator
ApplyBQSR
HaplotypeCaller
GenomicsDBimport (as you have only 15 samples you can use directly GenotypeGVCFs)
GenotypeGVCFs
VariantRecalibrator
ApplyVQSR

Also as you have a limited number of samples you should add 1000G samples (from here : https://software.broadinstitute.org/gatk/documentation/article.php?id=1259)

In our testing we've found that in order to achieve the best exome results one needs to use an exome SNP and/or indel callset with at least 30 samples. For users with experiments containing fewer exome samples there are several options to explore:

Add additional samples for variant calling, either by sequencing additional samples or using publicly available exome bams from the 1000 Genomes Project (this option is used by the Broad exome production pipeline). Be aware that you cannot simply add VCFs from the 1000 Genomes Project. You must either call variants from the original BAMs jointly with your own samples, or (better) use the reference model workflow to generate GVCFs from the original BAMs, and perform joint genotyping on those GVCFs along with your own samples' GVCFs with GenotypeGVCFs.

You can also try using the VQSR with the smaller variant callset, but experiment with argument settings (try adding --maxGaussians 4 to your command line, for example). You should only do this if you are working with a non-model organism for which there are no available genomes or exomes that you can use to supplement your own cohort.