GRCh37/38 reference genotype AF wrong ?
1
0
Entering edit mode
7 months ago

Dear Colleagues,

I am new to variant calling and started to analyse my VCF generated from WES bam files to isolate clinical relevant germline variations. The VCF was generated using GRCh38 as reference sequence. Now I stumpled over the fact that a hugh amount of variants carry a obviously very low global AF.

For example https://www.ncbi.nlm.nih.gov/snp/rs2728532 , Variant G>T seem to have a GAF G=0.00242 and T=0.99758. Does this mean that 'T' is the correct global genotype and 'G' is a rare 'variation' ?

Thank you in advance for your help !

GRCh38 reference-genome variant • 716 views
ADD COMMENT
1
Entering edit mode
7 months ago
cmdcolin ★ 3.8k

the reference genome, e.g. GRCh37 or GRCh38, does not attempt to choose the "globally most common variant at every position"...or put differently, it is not a "major allele reference genome". there are interesting consequences to this, and you can find a "major allele reference genome" (see e.g. https://github.com/BenLangmead/bowtie-majref) but people are also looking at things like "population specific major allele references" or just aligning to a pan genome graph

it's just kinda the case that the reference genome is just "some person that everyone compares to"...try explaining the reference genome to someone that isn't that well versed in bioinformatics, it's sometimes kinda a funny exercise to justify it.

see also interesting blogposts like https://liorpachter.wordpress.com/2014/12/02/the-perfect-human-is-puerto-rican/ (which is not about major allele frequency, but choosing all the "good alleles" from the snpedia database)

ADD COMMENT
0
Entering edit mode

Thank you very much for your explanation and the link to the majref sequence. From an evolutionary point of view, I would expect that a particular variant carried by 99.99% of humans should be the one that represents the near-optimum in terms of gene function (assuming that, on average, we are the crown of creation and do not need improvement ;)). Therefore, I thought that a "reference" should reflect the highest global genotype frequency ... but good to know that I was wrong.

ADD REPLY
0
Entering edit mode

the near-optimum in terms of gene function

How do you define optimum across ethnicities and geographies with their varied histories? Let's say the amount of oxygen needed by the body is optimized evolutionarily based on the altitude that a group of people have lived in; what would you consider a global near-optimum here? Your idea of a reference genome seems to come from a simplistic view of the world.

ADD REPLY
0
Entering edit mode

You are right, and I am aware of this fact, but I have to start from a reference that gives me an "oversimplified" gene structure. In the advanced stage, it could then be interesting to see what a variation does in a Sherpa population.

ADD REPLY
0
Entering edit mode

Why would you expect that? The vast majority of variants have no known effect on gene function and are not under selective pressure.

ADD REPLY

Login before adding your answer.

Traffic: 1660 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6