How are two alleles typically represented in a whole genome sequence?
Entering edit mode
2.2 years ago
adam • 0

I apologize in advance if this is a silly question - but I am trying to understand how two inherited variants of a gene are represented in typical whole genome sequencing formats (VCF, FASTA/Q).

Here is one example illustrating my confusion, using an SNP VCF from a WGS. Take TAS2R38.. GRCh37.p13 puts the gene's reference location at chr7:g.141672431 - chr7:g.141673573.

Using bcftools, if I call:

bcftools view genome.filtered.snp.vcf.gz 7:141673345-141673345
7   141673345   .   C   G   1434.3  PASS

How could I view the SNP, if any, from the other copy of the gene? And which allele am I viewing when I do the above?

genetics inheritance sequencing • 979 views
Entering edit mode

I am not sure I fully understand the question but are you asking how to understand the VCF format?

For this example gene, you have in the VCF the reference base 'C' and the alternative 'G' This describes your two alleles, where they are the same except for this one position.

Typically VCFs are generated by comparison to a single reference genome and therefore all variants/alleles are based on comparisons to this genome, hence the REF version and the ALT which comes from your sample/data used for comparison.

Does that answer the question?

Entering edit mode

Thanks for your reply. Sort of. In this example as you said I have a nucleotide variant of G where the reference base shows C. However as I understand it, this is only showing a SNP in 1 of 2 inherited genes in a diploid orgasnism. What about the other copy? Is there any way to know if the same SNP, or perhaps a different mutation, exists at the same location on the other copy of the gene? Do VCFs "collapse" the mutations from both copies of the gene into one? Hoping I am explaining this question more clearly.

Entering edit mode

Hi adam,

your VCF is missing some crucial information. What you have there is what you could call a "variant VCF" - it describes which variant may exist at a given position. It does however miss columns with the genotype.

In a VCF with samples you would have a column per individual, in which the alleles are encoded. The reference allele (C in your example) is encoded with a 0, the alternative allele (G in your example) with a 1. A heterozygous individual (with one copy of the reference allele and one copy of the alternative allele) would be 0/1. An individual which has two copies of the alternative allele would have 1/1.

Most often this genotype is based on counting the individual reads with either the C or G allele. Commonly a genome is sequenced to 30x coverage, meaning that every base is observed 30 times. For a heterozygous variant you would then expect to see 15 times one allele and 15 time the other one, although there is obviously some (Poisson) variation on that and you will get slight deviations from the ideal 50-50 ratio.

So yes, VCF "collapses" both copies of a gene/both alleles of a variant, but you can still figure out the status of both copies.


Login before adding your answer.

Traffic: 1597 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6