Question

GATK4 - To be known or not to be known, that is the question.

1

Entering edit mode

2.2 years ago

K.patel5 ▴ 140

Dear Biostars,

This question specifically is about the best practice of using GATK4 to look for snps and indels in human samples.

I am fairly new to genomics and I think I am misunderstanding how GATK have separated their --known-sites during BSQR and VSQR steps.

During BSQR (BaseRecalibrator) I believe the generic --known-sites used for human analysis are: dbsnps, Homo_sapiens_assembly38.known_indels, and Mills_and_1000G_gold_standard.indels. Also on a few posts online I saw the use of 1000G_phase1.snps.high_confidence.hg38 and 1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.

However when is comes to VSQR (VariantRecalibrator) the --resource flag changes some of these up. For example, in the standard example given on the Broad website (https://gatk.broadinstitute.org/hc/en-us/articles/360036351392-VariantRecalibrator), 1000G_phase1.snps.high_confidence.hg38 is now known=false, and dbsnps is the only dataset given the known=true flag.

I am confused to why BSQR and VSRQ steps can alternate which resources are classed as known. Is there a reference online to state which of the GATK resource bundle datasets should be used as known?

The GATK forum had a similar question, but I did not find the query to be answered decisively. (https://gatk.broadinstitute.org/hc/en-us/articles/360035890831-Known-variants-Training-resources-Truth-sets).

Any insight from more experience bioinformaticians on which datasets are most appropriate as known-sites would be a big help.

Cheers

WGS WES flags GATK • 1.9k views

ADD COMMENT • link updated 2.2 years ago by Santosh Anand 5.7k • written 2.2 years ago by K.patel5 ▴ 140

score 4 · Accepted Answer · 2022-01-25

First of all, thank you for a very clear Q and also for all the efforts you have made for nice formatting.

If you read the description of known on the VQSR page (https://gatk.broadinstitute.org/hc/en-us/articles/360036351392-VariantRecalibrator)

--resource / -resource Known - The program only uses known sites for reporting purposes (to indicate whether variants are already known or novel)

So, known sites are used only to nominate if a site has already been reported to be found elsewhere (known or novel). In this respect, it is imperative to take dbsnp as the most qualified resource, as they collate snp information from all other sources. See these

https://www.internationalgenome.org/faq/are-the-igsr-variants-available-in-dbsnp/

Which One Should I Use Hapmap Or 1000Genome Or Dbsnp?

Other SNP resources could be used for 'truth' and 'training' sets, which is explained in the 2nd link that you posted

A training set resource is a list of variants that is used by machine-learning based algorithms to model the properties of true variation vs. artifacts. This requires a higher standard of curation and validation of the variants that are included in the resource. Tools that take such a resource typically accept a parameter that indicates your degree of confidence in the resource. This type of resource is difficult to bootstrap, as it benefits greatly from orthogonal validation (e.g. through a different technology such as arrays or Sanger sequencing).

A truth set resource is a list of variants that is used to evaluate the quality of a variant callset (e.g. sensitivity and specificity, or recall). As such this requires the highest standard of validation, and tools that take such a resource will assume all variant calls it contains are true variation. This cannot be bootstrapped and must be generated using orthogonal validation methods.

As you see, the training set requires a higher degree of confidence and the truth set requires the highest degree of confidence - so you can choose them according to your confidence level.

For example, calibrating exome SNP data, different resources have been used for training and test sets. (see the link above)

--resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg38.sites.vcf.gz \
--resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.hg38.sites.vcf.gz \
--resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg38.vcf.gz \
--resource:dbsnp,known=true,training=false,truth=false,prior=2.0 Homo_sapiens_assembly38.dbsnp138.vcf.gz \

Using `dbsnp' has another advantage that it collects SNPs from other species. So you have single 'known' database for SNPs in all species.

BQSR

The --known-sites argument is used a little bit differently in BQSR (https://gatk.broadinstitute.org/hc/en-us/articles/360036898312-BaseRecalibrator#--known-sites)

--known-sites / NA One or more databases of known polymorphic sites used to exclude regions around known polymorphisms from analysis. This algorithm treats every reference mismatch as an indication of error. However, real genetic variation is expected to mismatch the reference, so it is critical that a database of known polymorphic sites is given to the tool in order to skip over those sites. This tool accepts any number of Feature-containing files (VCF, BCF, BED, etc.) for use as this database. For users wishing to exclude an interval list of known variation simply use -XL my.interval.list to skip over processing those sites. Please note however that the statistics reported by the tool will not accurately be reflected those sites skipped by the -XL argument.

So here you would give any/all sites, which has potential to be a SNP