importance of known sites/resources in GATK pipeline
1
1
Entering edit mode
9.2 years ago
Floydian_slip ▴ 170

Hi,

I have a general question about GATK related to the importance of known VCFs (for BQSR and HC) and resources file (for VQSR). I am working on rice for which the only known sites are the dbSNP VCF files which are built on a genomic version older than the reference genomic fasta file which I am using as basis.

How does it affect the quality/accuracy of variants? How important is to have the exact same build of the genome as the one on which the known VCF is based? Is it better to leave out the known sites for some of the steps than to use the version which is built on a different version of the genome for the same species? In other words, which steps (BQSR, HC, VQSR etc) can be performed without the known sites/resource file?

If the answers to the above questions are too detailed, can you please point me to any document, if available, which might address this issue?

Thanks,
Neil

Quality-Score-relcalibration GATK • 3.9k views
ADD COMMENT
0
Entering edit mode
9.2 years ago

How does it affect the quality/accuracy of variants?

You should read these posts to know how BQSR and VQSR work (http://gatkforums.broadinstitute.org/discussion/44/base-quality-score-recalibration-bqsr, http://gatkforums.broadinstitute.org/discussion/39/variant-quality-score-recalibration-vqsr)

How important is to have the exact same build of the genome as the one on which the known VCF is based?

It is important to have variant data or dbSNP from the same build of the genome unless it was a minor revision in the assembly which didn't change the coordinates between the two builds. If coordinates of the same variant/gene differ between two genomic builds then you shouldn't use them. But you can liftover to get the new coordinates.

Is it better to leave out the known sites for some of the steps than to use the version which is built on a different version of the genome for the same species?

It is better to leave out these steps if you dont't have dbSNP data for the same build but if you really want to try then a) you can use liftover to get the new positions OR b) call variants without these steps and manually select strong variants (high MAPQ, decent number of reads etc.) and repeat BQSR/VQSR using these set of variants.

In other words, which steps (BQSR, HC, VQSR etc) can be performed without the known sites/resource file

HC can be performed without the known variants.

PS: I have never seen any dramatic effect of performing BQSR on variant calling. BQSR is helpful but it doesn't aid much if you already have good NGS data to start with.

ADD COMMENT

Login before adding your answer.

Traffic: 1870 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6