Question: why is variant calling difficult?
gravatar for abc
23 months ago by
abc10 wrote:

I'm very new to this space, but from what I've read I'm not sure I understand what the process of variant calling looks like -- if we're given a reference genome, and then a specific genome of interest, what makes finding variants, which are just differences from the reference, so hard to identify?

For example, at a specific location the sequence, why is it wrong to simply take the majority of whatever bp is there and say "at this position, we are __% confident that individual X has this bp"? And since you would expect see heterozygosity at many loci, what does it even mean if you see "60% A and 40% T" at a given location?

It seems like alignment would be the actual tricky part of the process, not the 'calling' part.

I think i'm probably making some faulty assumptions or do not fully understand the problem, and I'd appreciate any explanation.

snp genome • 793 views
ADD COMMENTlink modified 23 months ago by igor12k • written 23 months ago by abc10
gravatar for Istvan Albert
23 months ago by
Istvan Albert ♦♦ 86k
University Park, USA
Istvan Albert ♦♦ 86k wrote:

Variant calling is difficult because the alignments are mathematical constructs that work by maximizing various predetermined rewards and penalities and by those imply that the simplest "explanation" is correct. Biology does not work this way. I have a chapter called Misleading alignments in the Biostar Handbook that I will summarize below:

Imagine that the sequence below is subjected to two insertions of Cs at the locations indicated with carets:

     ^       ^

The two sequences, when placed next to one another, would look like this:


A better way to visualize what’s happening in this example is shown in the expected alignment that would reflect the changes that we have introduced:

||||| |||||||| ||||||||||

Now suppose we did not know what the changes were. Can we discover the variation that we introduced by using a global aligner? Let’s see: CCAAACCCCCCCTCCCCCGCTTC CCAAACCCCCCCCTCCCCCCGCTTC

Here is what we obtain:

||||||||||||  .||||||||||

The variation indicated by the alignment is different: instead of the two insertions of C it shows one insertion of CT followed by a mismatch, adding insult to injury the variation is also shown to take place at a completely different location altogether.

You can see how difficult would be to call this correctly!

ADD COMMENTlink modified 23 months ago • written 23 months ago by Istvan Albert ♦♦ 86k

It's worth noting that this inherent representational ambiguity is also why _matching_ variants (for example looking up called variants in a variant database, intersecting two variant call sets to find common vs different variants, evaluating variant caller accuracy with respect to a gold standard benchmark dataset such as GIAB) is not as straight forward as people think. The most accurate variant comparisons require haplotype awareness (e.g. using a tool such as rtg vcfeval or

Even something as apparently simple as classifying a variant is a SNP vs Indel can be confounded by this, see the example under "Benchmarking performance for SNPs versus indels" in the vcfeval manual.

ADD REPLYlink written 23 months ago by Len Trigg1.5k
gravatar for Kevin Blighe
23 months ago by
Kevin Blighe69k
Republic of Ireland
Kevin Blighe69k wrote:

I feel that I answer some of these questions in my previous answer, here: A: DP in VCF files?

Others can make further comment specifically on the problems relating to alignment. However, a big challenge for aligners in my own mind is the high percent of the genome that exhibits sequence similarity. With short read NGS, it is virtually impossible to accurately align reads derived from certain regions. A few examples are VWF, which has a very large pseudogene on chr22, I believe, exons of some of the cytochrome (CYP) genes, and then genes that were born from duplication events throughout evolution.

The last PI to whom I mentioned this to correct him looked ashen / stone faced. In any case, no wonder that Illumina purchased PacBio, whose focus is on longer reads. Nanopore has also become popular (again focused on long reads).


ADD COMMENTlink modified 23 months ago • written 23 months ago by Kevin Blighe69k

the length of the read does not help in the case of misleading alignments - it is the "mathematics" that it is broken, or shall we say "overly simplistic" and not all that well suited to biology.

the alignments in question will be mathematically accurate - just biologically "wrong"

ADD REPLYlink modified 23 months ago • written 23 months ago by Istvan Albert ♦♦ 86k
gravatar for igor
23 months ago by
United States
igor12k wrote:

As others have already mentioned, alignment is part of the problem. There is additionally the inherent systematic error in the actual reads. Even if the alignment is completely accurate, the observed variants may not be real. Meacham et al. discussed that aspect and provide a nice illustration:

Types of errors

A screenshot from the IGV browser [21] showing three types of error in reads from an Illumina sequencing experiment: (1) A random error likely due to the fact that the position is close to the end of the read. (2) Random error likely due to sequence specific error- in this case a sequence of Cs are probably inducing errors at the end of the low complexity repeat. (3) Systematic error: although it is likely that the GGT sequence motif and the GGC motifs before it created phasing problems leading to the errors, the extent of error is not explained by a random error model.

ADD COMMENTlink modified 23 months ago • written 23 months ago by igor12k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2582 users visited in the last hour