why is variant calling difficult?
3
1
Entering edit mode
5.1 years ago
abc ▴ 10

I'm very new to this space, but from what I've read I'm not sure I understand what the process of variant calling looks like -- if we're given a reference genome, and then a specific genome of interest, what makes finding variants, which are just differences from the reference, so hard to identify?

For example, at a specific location the sequence, why is it wrong to simply take the majority of whatever bp is there and say "at this position, we are __% confident that individual X has this bp"? And since you would expect see heterozygosity at many loci, what does it even mean if you see "60% A and 40% T" at a given location?

It seems like alignment would be the actual tricky part of the process, not the 'calling' part.

I think i'm probably making some faulty assumptions or do not fully understand the problem, and I'd appreciate any explanation.

genome SNP • 2.1k views
ADD COMMENT
5
Entering edit mode
5.1 years ago

Variant calling is difficult because the alignments are mathematical constructs that work by maximizing various predetermined rewards and penalities and by those imply that the simplest "explanation" is correct. Biology does not work this way. I have a chapter called Misleading alignments in the Biostar Handbook that I will summarize below:

Imagine that the sequence below is subjected to two insertions of Cs at the locations indicated with carets:

CCAAACCCCCCCTCCCCCGCTTC
     ^       ^

The two sequences, when placed next to one another, would look like this:

CCAAACCCCCCCTCCCCCGCTTC
CCAAACCCCCCCCTCCCCCCGCTTC

A better way to visualize what’s happening in this example is shown in the expected alignment that would reflect the changes that we have introduced:

CCAAA-CCCCCCCT-CCCCCGCTTC
||||| |||||||| ||||||||||
CCAAACCCCCCCCTCCCCCCGCTTC

Now suppose we did not know what the changes were. Can we discover the variation that we introduced by using a global aligner? Let’s see:

global-align.sh CCAAACCCCCCCTCCCCCGCTTC CCAAACCCCCCCCTCCCCCCGCTTC

Here is what we obtain:

CCAAACCCCCCC--TCCCCCGCTTC
||||||||||||  .||||||||||
CCAAACCCCCCCCTCCCCCCGCTTC

The variation indicated by the alignment is different: instead of the two insertions of C it shows one insertion of CT followed by a mismatch, adding insult to injury the variation is also shown to take place at a completely different location altogether.

You can see how difficult would be to call this correctly!

ADD COMMENT
1
Entering edit mode

It's worth noting that this inherent representational ambiguity is also why _matching_ variants (for example looking up called variants in a variant database, intersecting two variant call sets to find common vs different variants, evaluating variant caller accuracy with respect to a gold standard benchmark dataset such as GIAB) is not as straight forward as people think. The most accurate variant comparisons require haplotype awareness (e.g. using a tool such as rtg vcfeval or hap.py).

Even something as apparently simple as classifying a variant is a SNP vs Indel can be confounded by this, see the example under "Benchmarking performance for SNPs versus indels" in the vcfeval manual.

ADD REPLY
3
Entering edit mode
5.1 years ago

I feel that I answer some of these questions in my previous answer, here: A: DP in VCF files?

Others can make further comment specifically on the problems relating to alignment. However, a big challenge for aligners in my own mind is the high percent of the genome that exhibits sequence similarity. With short read NGS, it is virtually impossible to accurately align reads derived from certain regions. A few examples are VWF, which has a very large pseudogene on chr22, I believe, exons of some of the cytochrome (CYP) genes, and then genes that were born from duplication events throughout evolution.

The last PI to whom I mentioned this to correct him looked ashen / stone faced. In any case, no wonder that Illumina purchased PacBio, whose focus is on longer reads. Nanopore has also become popular (again focused on long reads).

Kevin

ADD COMMENT
1
Entering edit mode

the length of the read does not help in the case of misleading alignments - it is the "mathematics" that it is broken, or shall we say "overly simplistic" and not all that well suited to biology.

the alignments in question will be mathematically accurate - just biologically "wrong"

ADD REPLY
1
Entering edit mode
5.1 years ago
igor 13k

As others have already mentioned, alignment is part of the problem. There is additionally the inherent systematic error in the actual reads. Even if the alignment is completely accurate, the observed variants may not be real. Meacham et al. discussed that aspect and provide a nice illustration:

Types of errors

A screenshot from the IGV browser [21] showing three types of error in reads from an Illumina sequencing experiment: (1) A random error likely due to the fact that the position is close to the end of the read. (2) Random error likely due to sequence specific error- in this case a sequence of Cs are probably inducing errors at the end of the low complexity repeat. (3) Systematic error: although it is likely that the GGT sequence motif and the GGC motifs before it created phasing problems leading to the errors, the extent of error is not explained by a random error model.

ADD COMMENT

Login before adding your answer.

Traffic: 1669 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6