Unreasonable allele frequency (AF) found in my sample VCF file
1
0
Entering edit mode
23 months ago
Bide • 0

Hi everyone. I have a VCF file containing variants from a single genome sampled. The AF values that come along in the file is confusing me. So, to my knowledge, the AF value, the occurrence of the major allele, should be 1, 0.5 or 0, indicating homozygous, heterozygous, or the homozygous(alt/alt) alleles according to AF = allele occurrence/total allele sampled (2 in this case) because it is only one individual's data. But why do I have AF values such as 0.75 or 0.25 there as well? I though it was due to the presence of multiallellic variations, yet some biallelic SNPs also have AF=0.25 or 0.75.

Could someone explain please. Thank you in advance. :)

PGS • 757 views
ADD COMMENT
1
Entering edit mode

show us the lines for such variants.

ADD REPLY
2
Entering edit mode
23 months ago
d-cameron ★ 2.9k

But why do I have AF values such as 0.75 or 0.25 there as well? I though it was due to the presence of multiallellic variations, yet some biallelic SNPs also have AF=0.25 or 0.75.

There are many possibilities. These include:

  • Your sample has copy number of 4 at those positions. This is very common for somatic samples

  • The reference genome has collapsed homologs. If there's actually two homologous genes that have been (intentionally or unintentionally collapsed) then your copy number will be higher

    • Common for poorly assembled reference genomes
  • Your reference genome doesn't include common duplications such as retrocopied pseudogenes. Variants in the gene or the homolog will show up as 0.25 or 0.75 for samples with the retrocopied pseudogene. There's around 50 common non-reference retrocopied pseudogenes not in the human genome reference.

  • Your sample has multiple segmental duplications and this region is within them. Less likely than the other possibilities (assuming non-cancer) as it requires two independent amplification events or an amplification that results in two additional copies.

  • The variants are within a STR or VNTR. These routinely have non-reference copy number

  • You have low coverage so your AF is noisy

    • e.g. 4 reads, one with the variant gives an AF of 0.25.
  • Your caller is bad

  • It's a tetraploid species.

    • plant genomes are messy
  • ...

  • ...

ADD COMMENT
0
Entering edit mode

Thank you very much! That explains a lot of things!

ADD REPLY

Login before adding your answer.

Traffic: 2161 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6