Question: Check if REF allele is minor allele in any variant
gravatar for Ram
6.3 years ago by
Baylor College of Medicine, Houston, TX
Ram32k wrote:

From the discussions in previous questions, I understand that REF and ALT need not necessarily correspond to major and minor alleles. REF is from the ref genome and could very well be the minor allele for the variant.

I'd like to find out if a REF allele is a minor allele for any variant in my region of interest. One of the ways I could do this is to find out COUNT(variants) where af > 0.5 in my region of interest. 

Would I be correct in assuming this approach will definitely give me the right answer? Is there any underlying assumption I'm missing before I use this as my standard approach?

Any anomalies you might have noted in your experience would help me. Thank you!

minor allele variant ref • 4.1k views
ADD COMMENTlink modified 2.5 years ago by Shicheng Guo8.6k • written 6.3 years ago by Ram32k
gravatar for Jorge Amigo
6.3 years ago by
Jorge Amigo12k
Santiago de Compostela, Spain
Jorge Amigo12k wrote:

you're right: REF allele is just a convention used to describe the allele that corresponds to the reference genome. if for whatever reason you want to know in which variants the REF allele is the minor, looking at AF should do. since REF is a convention there's no biological interest in finding that out, other of course than simply describing and characterizing the reference genome.

but you must have in mind that you could find particular cases out there where the AF calculation may not be as straight-forward as you may think. for instance, in case you have a multiallelic variant where the minor allele is one of the alternative alleles and the major allele is another alternative allele, filtering by the frequency of the first alternative allele >0.5 would output the variant although the REF allele wouldn't be the minor allele. you must force that the AF accounts for all alternative alleles, which can be achieved using for instance bcftools view -q 0.5 file.vcf, as the default -q behaviour is to calculate the AF using all the non-reference alleles.

ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by Jorge Amigo12k

Thank you, Jorge. Your reply on a different post was one of my references for the REF/ALT definition. We have a local DB created from reformatted/processed 1000genomes. I'll check on how the DBA dealt with multi-allelic variants.

If they haven't dealt with it the right way, I can always use your bcftools command with the raw VCF. Thank you :-)

ADD REPLYlink written 6.3 years ago by Ram32k

the cases where filtering by AF>0.5 wouldn't work if AF is calculated only with the first alternative allele are rare, but take them into consideration is more appropriate though. also, have in mind that filtering 1000genomes raw data by AF also deals with indels, which I'm not sure that could help you to achieve your goal.

ADD REPLYlink written 6.3 years ago by Jorge Amigo12k

I ran a bunch of queries on my DB. There were no multi-allelic variants of any kind, and for all variants with only one REF and ALT alleles, I found no case where af was >= 0.5. I guess I can safely assume that all ALT alleles are minor alleles in my sample space.

ADD REPLYlink written 6.3 years ago by Ram32k

there are indeed multi-allelic variants on latest 1000genomes release (previously they used to collapse to bi-allelics) as stated in the callset readme file, and plenty of variants with AF > 0.5 too. if you don't find any yourself then it does depend on the way you've built your database, or on the region or the samples you are considering.

ADD REPLYlink written 6.3 years ago by Jorge Amigo12k

It is the region, I am quite certain. We store multi-allelic variants as multiple records, one record per ALT allele in an SQL database.

ADD REPLYlink written 6.3 years ago by Ram32k

On the indels, I'm targeting only SNVs anyway.

ADD REPLYlink written 6.3 years ago by Ram32k
gravatar for Shicheng Guo
2.5 years ago by
Shicheng Guo8.6k
Shicheng Guo8.6k wrote:

In terms of phase III data-set from 1000 Genome project, only 2149549/84802133=2.5% have >50% or higher Alternative allele frequency. It make sense since human genome is derived from several individual human genome and therefore, the reference genome should have high probability to be major allele.

ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by Shicheng Guo8.6k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2419 users visited in the last hour