Question

Gatk Variant Discovery And Filtering Process For 500K Target Regions

0

Entering edit mode

11.8 years ago

Nino ▴ 20

I am trying to apply GATK pipeline for my NGS experiment (500K for each sample) and I don't understand exactly what to do after I obtain recalibrated analysis-ready reads BAM files.

1) Should I run separately Unified Genotyper for SNPs and Indels? What about Structural Variation?

2) In best practice for Variant discovery v3, GATK wiki page explains that for small projects I cannot use VQSR module but I should use the old Variant Filtration walker. Any suggestion about parameter setting? If I run separately Unified Genotyper for SNPs and Indels are parameters different? and how to merge the different variants file to obtain a single VCF file?

thank you

gatk variant • 3.5k views

ADD COMMENT • link updated 11.8 years ago by Jorge Amigo 14k • written 11.8 years ago by Nino ▴ 20

score 0 · Answer 1 · 2012-07-25

GATK's UnifiedGenotyper walker can deal with SNPs and Indels both at the same time. all you have to do is to use option "-glm BOTH" as described on its manual page. discovering CNVs is far more complicated, and is not usually performed without sequencing the entire genome (I say usually because you can try inferring structural variations using several samples altogether and studying their sequencing differences; some literature is coming out on this, although I can't give you any further information regarding if it works or not because we haven't tried it yet)

although you can try tunning it for a low number of variants, the VQSR module works fine when dealing with several thousands of variants per sample, such as exome or whole genome sequencing. the underlying idea is to check how the variants detected by your experiment behave in relation to a very well known source of variation (such as HapMap variants), by creating statistical behaviour models for all the variants that match that reference dataset and applying the results to the rest of the variants. that's why you need to have lots of variants, because you need to have a significant overlap between your experiment and your dataset of reference. the VariantFiltration does not work at all in the same way: it does not determine whether your variants are trustworthy or not, it just allows to filter large numbers of variants matching a certain criteria. if you can't use VQSR you may always want to try looking for tr/tv ratios or something similar.