Alternate nucleotide is more frequent than reference nucleotide. OMG I'm dizzy. How do I stop the twirl?
2
8
Entering edit mode
3.4 years ago
Farrel ▴ 220
1. How did it come to be that the alternate nucleotide was more frequent than the reference nucleotide?
2. How does one account for this phenomenon when designing a strategy to filter for variants of interest? Should I go through the complicated process of selecting those individuals who DO NOT have the variant and calculate that the REFERENCE frequency in the population is probably around (1 - esp6500siv_all)?

I am researching a rare disease and have whole exome sequence data with the corresponding variant calls. Each variant call has been passed to annovar and among other data, we have looked up the frequency of the variant in the esp6500siv2_all data. Clearly a variant that was observed to have a high frequency in our sample but that had low frequency in esp6500siv2_all would be of disproportionate interest.

Low and behold I was surprised to find that 13% of the all of our variants (4055 out of 32131) had an allele frequency that was greater than 0.5. How can that be? I expected that all the allele frequencies would be <0.5.

I had thought that the variants would be akin to a minor allele frequency (MAF). Clearly I was wrong. I pulled 3 random variants from among the variants that had more than 0.5 frequency, to check them against the Exome Variant Server.

    avsnp147 Chr     Start       End Ref Alt Gene.refGene   esp6500siv2_all
1: rs3803530  15  89632842  89632842   C   A      KIF7             0.5373
2:  rs621383   3 125118840 125118840   T   C      SLC12A8          0.9988
3:  rs633561  11  64229857  64229857   A   G      NUDT22           0.9418


Looking up at NHLBI Exome Sequencing Project (ESP) Exome Variant Server and using All Allele

1. rs3803530: C>A; A=6984/C=6014 which means A is 6984/(6984+6014) or 0.537
2. rs621383: T>C; C=12479/T=15 which means C is 12479/(12479+15) or 0.999
3. rs633561: A>G; G=12240/A=756 which means G is 12240/(12240+756) or 0.941
exome maf variant • 4.4k views
2
Entering edit mode

1) How did it come to be that the alternate nucleotide was more frequent than the reference nucleotide?

the reference genome carries the rare allele.

1
Entering edit mode

OK and the reference genome would be just one person's at any one spot? The entire genome could be made up of many people's genome but at any one locus it would be just one person's sequence? So would I be correct that all the SNPs mentioned in the NHLBI Exome Sequencing Project (ESP) would be from an individual who is homozygous at that point? If they were heterozygous there could be no ref vs alt.

2
Entering edit mode

Hey Farrel, hg38 was released in the wake of the 1000 Genomes Project, where whole genome sequence data from ~2500 individuals became available. This information was in part used to construct hg38 where, for example, many of the rare disease risk alleles of hg19 were modified to represent more common alleles. hg38 also improves on sequence in centromeric and other repeat regions, which were previously difficult to sequence. So, yes, it's still a linear representation of the genome and has its flaws.

I'm not sure that I understand your point about the ESP. The ESP is a disease association study and includes heterozygous and homozygous variants.

The goal of the NHLBI GO Exome Sequencing Project (ESP) is to discover novel genes and mechanisms contributing to heart, lung and blood disorders by pioneering the application of next-generation sequencing of the protein coding regions of the human genome across diverse, richly-phenotyped populations and to share these datasets and findings with the scientific community to extend and enrich the diagnosis, management and treatment of heart, lung and blood disorders.

ESP allele frequencies, like all others, are just counted 1 for het and 2 for hom. If my variant is observed as het in 1 individual in my cohort of 500 patients, then the allele frequency is 1 / (500 * 2) = 0.001%

13
Entering edit mode
3.4 years ago

This is due to the fact that the very reference genomes that we use for re-alignment are themselves based on individuals who carry rare risk alleles. Thus, when we call variants against these genomes, we are, at many loci, comparing against rare disease risk alleles.

As the best/worst example (depending on your point of view), hg19 / GRCh37 was used for more than a decade as the primary reference genome, yet ~70% of the genomic sequence of this genome was based on a single individual from the Buffalo area, New York, USA. Amongst the many 1 000s of rare disease susceptibility alleles that this individual carried was one called Factor V Leiden, which statistically significantly increases the risk of deep vein thrombosis (DVT). If you're researching DVT (I was), you have to be aware of this.

Thus, if I perform exome-seq on an individual who does not have Factor V Leiden and re-align the data to hg19 / GRCh37, the Factor V Leiden variant position will show a SNV because the reference allele in my patient sample (which doesn't increase risk of DVT) is being compared against the disease allele that's contained in the very reference genome against which I'm re-aligning my data. Without careful screening, I may assume that my patient has increased risk of DVT, erroneously so.

There was a publication on this listed in PubMed but it's very difficult to find, even by Google. It's a critical problem yet has not received the attention that it deserves.

The situation improved with hg38 / GRCh38, as this reference build was based on much more individuals, but the same problems still persist, broadly speaking.

So, you really have to get to know your target panel and all of these nuances related to whatever variants you're studying., particularly if you're dealing with live patient data.

Kevin

-----------------------------------------

Update 3rd January 2018

It has come to my attention that there is an automated method to search for these types of variants in your VCF:

11
Entering edit mode
3.4 years ago

reference ≠ major ≠ ancestral ≠ wildtype

As Kevin says, the reference is just whatever is in the reference sequence, which is the sequence of whoever they happened to sequence for that region.

GRCh38 is an improvement compared to GRCh37 because the GRC sought out some loci where the reference allele was not the major allele in the 1000 Genomes project, and replaced those regions with tiny contigs which did have the major allele. Some, not all.

1
Entering edit mode

Hi Emily, How to understand ancestral ≠ wildtype? Thanks.

4
Entering edit mode

The ancestral allele is identified by tracing back up the evolutionary tree to see what other primates have at the same location. A mutation way back in our lineage at a particular locus may be one of the small evolutionary changes that make us human. Individuals who have the ancestral allele may have a phenotype that makes them more like our ancestors (maybe long arms, a heavy brow or slight mental retardation), in which case we could say that the ancestral allele is associated with the phenotype.

Generally I don't like the word "wildtype" at all because it infers that one allele confers a phenotype, and the other does not, but in fact both alleles confer phenotypes, it just depends on your perspective as to which one is "normal". When we're talking about human phenotypes, that perspective is often drawn along racial lines, which is not acceptable. For example rs4988235. The reference allele is G, which happens to also be the ancestral allele which is found in all our primate relatives. As a European, I consider the reference allele G to be the one associated with a phenotype: lactose intolerance. However, a non-European might say that the phenotype association is with the alternative allele, A: lactase persistence. I would, therefore be uncomfortable with assigning either of those alleles the term "wildtype".