I have 10 strains of the same non-human species and I want to use VQSR on the raw single sample SNP calls. A subset of the 10 strains have been genotyped using a SNP array. How can I use that SNP array as a truth dataset? The snp array data is in some kind of a csv format. Can I just convert that format to the BED format and extract those positions from the single sample VCF, to a truth VCF, and supply that as a truth dataset to VQSR?
Can I use the same dataset as a training dataset, or does this need to be a different ( bigger and / or non overlapping?) dataset than the truth dataset? Or can I just for example take all the high quality (quality above 100) from the single sample raw SNP calls and supply this a a training set?
I also have reference call's in my raw single sample VCF's. Do I manually need to exclude them completely from the VQSR process?
I also posted the question on http://gatkforums.broadinstitute.org/discussion/39/variant-quality-score-recalibration-vqsr
More background info: VariantRecalibrator http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantrecalibration_VariantRecalibrator.html What VQSR training sets / arguments should I use for my specific project? http://www.broadinstitute.org/gatk/guide/article?id=1259
Nobody? I can't be the first person trying to use GATK VQSR without existing training data available?
It would make sense to use your SNP array data as truth, as for using it with the VQSR, I assume it would be relatively straightforward to convert your SNP data into a VCF-acceptable format.