Question

Gatk, Variant Quality Score Recalibration, How To Work With Custom Truth And Training Datasets?

1

Entering edit mode

10.9 years ago

William ★ 5.3k

I have 10 strains of the same non-human species and I want to use VQSR on the raw single sample SNP calls. A subset of the 10 strains have been genotyped using a SNP array. How can I use that SNP array as a truth dataset? The snp array data is in some kind of a csv format. Can I just convert that format to the BED format and extract those positions from the single sample VCF, to a truth VCF, and supply that as a truth dataset to VQSR?

Can I use the same dataset as a training dataset, or does this need to be a different ( bigger and / or non overlapping?) dataset than the truth dataset? Or can I just for example take all the high quality (quality above 100) from the single sample raw SNP calls and supply this a a training set?

I also have reference call's in my raw single sample VCF's. Do I manually need to exclude them completely from the VQSR process?

I also posted the question on http://gatkforums.broadinstitute.org/discussion/39/variant-quality-score-recalibration-vqsr

More background info: VariantRecalibrator http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantrecalibration_VariantRecalibrator.html What VQSR training sets / arguments should I use for my specific project? http://www.broadinstitute.org/gatk/guide/article?id=1259

gatk • 4.2k views

ADD COMMENT • link 10.9 years ago by William ★ 5.3k

0

Entering edit mode

Nobody? I can't be the first person trying to use GATK VQSR without existing training data available?

ADD REPLY • link 10.9 years ago by William ★ 5.3k

0

Entering edit mode

It would make sense to use your SNP array data as truth, as for using it with the VQSR, I assume it would be relatively straightforward to convert your SNP data into a VCF-acceptable format.

ADD REPLY • link 10.5 years ago by lavinia.gordon • 0