Question

Joint genotyping using GATK - how important is it?

3

Entering edit mode

7.7 years ago

JJ ▴ 760

Hi all,

I think GATK is a great toolbox. There a quite a few steps involved and I was wondering on the impact and importance of joint genotyping - in particular when working with very small sample sizes (around 10 -15 samples). I read that it can lead to a loss on unique snps, which we would be particularly interested in. Has anymore experience with this and can tell me about the effects on the outcome? In particular in quantitative terms, do I gain many more overlapping SNPs - do I loose a lot of unique SNPs? Or is the impact negligible?

Thanks for your input!

SNP sequencing • 6.8k views

ADD COMMENT • link updated 7.7 years ago by igor 13k • written 7.7 years ago by JJ ▴ 760

2

Entering edit mode

I don't believe that it's important and I am unsure how it is any better than just merging multiple VCFs together using BCFtools, provided, of course, that you have performed very well your filtering of reads prior to variant calling and also set specific rules about merging. Using BCFtools, there are never issues with private ('unique') variants.

ADD REPLY • link 7.7 years ago by Kevin Blighe 89k

1

Entering edit mode

If you have 100 samples with a heterozygous SNP at 2x coverage, you are unlikely to call it in any of them individually. It's much more likely to be called if you call them together.

ADD REPLY • link 7.7 years ago by igor 13k

0

Entering edit mode

Why would you be calling variants at a read depth of 2? You'd be raising your chances of making false-positive calls elsewhere.

ADD REPLY • link 7.3 years ago by Kevin Blighe 89k

0

Entering edit mode

That was an extreme example to illustrate a point. On the other end of the spectrum, if you have sufficiently high coverage, all methods will converge towards the truth.

The problem is that you don't always know where you are on that spectrum or you already have the data and you have to work with what you have.

ADD REPLY • link 7.7 years ago by igor 13k

1

Entering edit mode

It's a common misconception that higher coverage will improve quality. Even with 1000x, we still miss many true variants and have to resort to running the sample multiple times just to find everything. Having high coverage brings other types of problems than having low coverage but I'd obviously prefer to have higher coverage.

The problem is that you don't always know where you are on that spectrum or you already have the data and you have to work with what you have.

That's certainly true. I get around this through the generation of multiple subsets of each aligned BAM, and calling variants independently on each subset. At the end, I come up with a consensus list.

ADD REPLY • link 7.3 years ago by Kevin Blighe 89k

1

Entering edit mode

Even with 1000x, we still miss many true variants and have to resort to running the sample multiple times just to find everything.

I worked on multiple clinical panels, so looking at the known variants several times over. Never ran into that problem. Of course, everyone's experience will vary based on many factors. Coverage is not everything.

If you'd like a more official opinion, here is a citation based on the MSKCC panel:

we performed a reproducibility analysis on replicates of 13 control samples, where we determined that all non-reproducible variant calls (i.e., noise artifacts) could be filtered out using a coverage depth threshold of 50X and variant frequency threshold of 20% for exonic variants

ADD REPLY • link 7.7 years ago by igor 13k

0

Entering edit mode

Thanks for that reference!

ADD REPLY • link 7.7 years ago by JJ ▴ 760

0

Entering edit mode

Thanks for your input. May I ask how you would recommend to filter the reads (I performed adapter and quality trimming before mapping and mark duplicates before variant calling). May I also ask what specific rules about merging you use? I think BCFtools would be a great option for me since I am really mainly interested in the singletons.

ADD REPLY • link 7.7 years ago by JJ ▴ 760

score 7 · Accepted Answer · 2017-11-16

7

Entering edit mode

7.7 years ago

igor 13k

There is a great answer provided previously: Variation & Genotype Calling From Ngs Data - Per Sample Or Multi Sample?

To summarize:

joint calling has biased false negative rate: it does better if a SNP is shared between samples but worse if it is a singleton

ADD COMMENT • link 7.7 years ago by igor 13k

0

Entering edit mode

Yes, that's what I figured! Thanks. As I primarily need the singletons I'll leave the step out.

ADD REPLY • link 7.7 years ago by JJ ▴ 760