Question

Variant Calling Heterozygous Reference Alleles

0

Entering edit mode

2.7 years ago

mti193 ▴ 10

I am going to be working with VCF files a lot in the near future so I thought I would brush up on the practice. After much reading and research, there's something that I just can't wrap my head around.

1) In a diploid organism, you have 2 alleles for a particular gene. My question is, how is this captured within the reference alignment sequence when the Reference alignment sequence is "single stranded" in that it fails to capture a possible heterozygous individual for a particular gene. For example, the reference genome at a particular locus will only have 1 nucleotide present.

In the following example:

#CHROM  POS ID  REF ALT     QUAL    FILTER  INFO    FORMAT  NA12878 

20  10001019    .   T   G   364.77  .   [CLIPPED]   GT:AD:DP:GQ:PL  0/1:18,15:33:99:393,0,480

It is deemed that the sample NA12878 is heterozygous for this position in that he/she has a T/G allele at each locus in the Chromosome 20. The question from above is referring to the reference. There is only 1 base. In actuality shouldn't there be maybe 2 alleles if the reference individual was heterozygous? If the reference individual also was heterozygous at this position and lets say he/she also had a G allele at the same locus, then shouldn't the VCF be reported as 0/0 and there would in fact maybe be no variants at all?

Flipping this around, lets say the NA12878 individual was used as a reference. At position 10001019 in Chromosome 20, which would be the REF? Would it be the T or the G allele since the person is heterozygous for both?

alignment vcf gatk variant SNP calling • 1.1k views

ADD COMMENT • link updated 2.7 years ago by LauferVA 4.2k • written 2.7 years ago by mti193 ▴ 10

score 1 · Answer 1 · 2021-08-15

I am going to write a somewhat indirect answer to your question - I am going to reframe it a bit for you. I hope that this will be helpful in the long run, even if I don't tackle your specific question.

The first thing I'd like to ask you to do is think about nucleotide variants that have differing allele frequency across human populations. For instance, let's say rs12345 has allele frequency 0.31 C 0.69T in Western European populations, but AF: C 0.82 and T 0.18 in Yorubans. Although you used the word reference, the first thing I want to do in this case is point out the meaninglessness of the terms "major" and "minor" allele.

Reference and alternate, as you have stated above, I think is a better nomenclature ... but here's the kicker... the more you think about individual instantiations of human genomes and all the diversity of our species, the less the concept of a "reference" makes means ... There are literally thousands of HLA class I alleles, and the purpose of this diversity is to ensure the survival of the species from myriad pathogenic threats. Why would we call one of these a reference?

Shortcomings of the idea of a reference genome:

The reference is missing hundreds of millions of bp of sequence. Our reference genome is based on only 2 Caucasian people. Because these people are of European descent and thus went through a population bottleneck, our reference genome necessarily does not reflect the totality of what is in the larger (human) population. If we still believe the oft quoted "statistic" that "we are 99.9% the same" it might make sense to use one, but recent results don't support that idea. Though there are some reasonable criticisms of this manuscript in Nature Genetics]1, I think nevertheless the authors make some fair points, including about how much information may be missing from our "reference".
The use of a reference genome that is strongly skewed towards one ancestry creates problems in variant calling algorithms. This is technical, but the gist is that the callers lend more credence to variants that are known to exist. This means if European genomes are the reference and the most data have been generated on individuals of European and East Asian descent, then we get better quality results when sequencing individuals more similar to those populations.
The need to have a reference genome persists due to bioinformatic necessity, not biologic utility What I mean is, as long as our sequencing technology is dominated by techniques that massively amplify short reads and then align them, then it will seem necessary to align them to something, and that might as well be called a reference. I am not trying to claim that there are no biologists who have benefited from using the "reference genome model", so to speak, it was an OK starting place. But it is a grossly limited concept that I think lends itself to inadequate appreciation of how variable genomes really are not to mention what the purpose of that diversity is... I'll try to explain in 4.
There are better ideas for how to represent genetic information out there already. There's a growing body of literature that uses Alignment-free read mapping. There are also emerging ways of representing the genome on the horizon; put for example "Genome Graphs." into Google or Pubmed. Social Justice and such aside (viz. point 2) I think if you look at images of genome graphs and reflect on the points above, you'll conclude as I have that reference genomes quite simply contain less information about biology.

Back to your question: Anyway, you didn't want to read any of this. You just wanted to know how to represent something in a darn v4.4 VCF file. I can respect that. But I took time out to write this because, in retrospect, it would have saved me time to explode the idea of a reference from the outset. It's a quaint, antiquated historical artifact that must not limit our view of what genetics is, and must not be allowed to negatively impact social justice in genomic research and clinical practice.

Good luck!!!! VL

score 0 · Answer 2 · 2021-08-15

0

Entering edit mode

2.7 years ago

sbstevenlee ▴ 480

Not sure if I understood your question 100%, but the important thing to keep in mind is: Yes, humans are diploid but the reference genome is haploid. Using your own example, the reference genome has T at chr20 10001019 while NA12878 has both T and G. Hope this helps.

ADD COMMENT • link 2.7 years ago by sbstevenlee ▴ 480