Question: GATK4 Variant calling with non-human model and no known SNP database
0
gravatar for Lidia
13 days ago by
Lidia 0
Germany
Lidia 0 wrote:

Hi everyone. I recently started working with DNA whole genome sequencing for variant calling with GATK 4.0.

I am working on a fish where I don´t have a database of know SNPs nor of indels. I have a total of 394 individuals. This means that I have 394 WGS samples and I would like to use the GVCF workflow.

According, to what I have read, I need to create such lists (known SNPs and indels) with my own data. However, I have a couple of questions regarding the pipeline to achieve this.

1) In order to generate my list of SNPs and INDELs that will be provided as input for Base Quality Score Recalibration, should I use the Haplotypecaller in normal mode (where I get a .vcf file)? Or should I use the GVCF mode in this first round of the Haplotypecaller (where I get a g.vcf file)?

2) Since this first Haplotypecaller round will be done per sample, at the end I will have a total of 394 output files. Should I combine them all together and keep only the high quality variants, so that at the end I have only one file of SNPs and one of INDELs to use for all the 394 samples? Or should each sample be recalibrated with its own set of SNPs and INDELs?

Many thanks to all of you for your help and support.

Lidia

snp genome • 113 views
ADD COMMENTlink modified 12 days ago • written 13 days ago by Lidia 0
2
gravatar for Ace
13 days ago by
Ace60
Ace60 wrote:

The GVCF mode in GATK is designed to do variant calling in groups. In theory you should get the same result doing a direct HC and doing the gvcf mode, it's just that the latter allows you to skip some time if you add samples in or want to use a different combination later.

You want to make a g.vcf for each sample, then combine them, then genotype them. You can then use your top variants in VQSR if you so desire. If you have subsets of samples that you think may behave differently, it may be worth repeating the combine>genotype>select pipeline separately with some of those subsets to put them in as different resources in VQSR so that the algorithm can be trained to recognize the necessary patterns. Otherwise, I think you're fine using just the group-calls.

ADD COMMENTlink modified 13 days ago • written 13 days ago by Ace60
0
gravatar for Lidia
12 days ago by
Lidia 0
Germany
Lidia 0 wrote:

Thank you very much for your answer. I wil continue then with the Haplotyper caller in GVCF mode. Thanks!

ADD COMMENTlink written 12 days ago by Lidia 0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1449 users visited in the last hour