Question: Strange reference allele convention
gravatar for
4 months ago by
United Kingdom wrote:

Hey all,

I downloaded the accompanying snps from the paper:

Genome-wide identification of splicing QTLs in the human brain and their enrichment among schizophrenia-associated loci Atsushi Takata, Naomichi Matsumoto & Tadafumi Kato

but found some strange alignment to the reference genome a small number of the 8966 snps. An example is:

snp     pos               ref    alt
chr11  66329732 G   A/G

whereas I make this reference base to be A on hg19.

I am not sure what is going on here, and was wondering if I am missing a convention with how these snps are report so I can get the correct alignment. Is it that 'ref' is not actually the reference base, but the allele with higher frequency? Given this isn't reported in the paper it makes it quite tricky to ensure the correct alignment of snps to the reference genome.

snp alignment • 202 views
ADD COMMENTlink modified 4 months ago by Kevin Blighe37k • written 4 months ago by

Odds are they're using their own "convention" to denote a data feature using an option that is not commonly used for other notations. sigh, when will people ever stop doing this?

EDIT: You should email the authors. The ref allele is A and the variant is A>G. There is no rationale behind why they'd call the reference G.

ADD REPLYlink modified 4 months ago • written 4 months ago by RamRS20k
gravatar for Kevin Blighe
4 months ago by
Kevin Blighe37k
Republic of Ireland
Kevin Blighe37k wrote:


Please take a look at my answer (and that of Emily), here: A: Alternate nucleotide is more frequent than reference nucleotide. OMG I'm dizzy.

The 'reference' base is strictly the base that appears in the reference genome, which may or may not have an association to a particular disease / phenotype and which may or may not be the allele with the highest allele frequency in the population of interest. Indeed, hg19 / GRCh37 contains many thousands of bases that are, in fact, the minor allele.

Relating to the example that you have picked out, the rs ID is rs540874. While A is indeed the base in the reference genomes from GRCh at this position, G is actually the major allele across all populations.

This confusion has undoubtedly resulted in many errors in published literature and also in clinical reports from both hospitals and private companies, but what can one do? It means that, unless one is aware of each and every case like this, then it may appear that a patient has a disease phenotype when, in fact, it is the dude who supplied his DNA to create hg19 who has the disease phenotype. hg38 has the same issues.

I do not believe it was ever strictly and clearly stated by the GRC the limitations surrounding the usage of their reference genomes in re-sequencing studies. There was a publication on this years ago but it is surprisingly (or, unsurprisingly?) difficult to find. Something of this level of importance should have been published in all of the top journals.


ADD COMMENTlink modified 4 months ago • written 4 months ago by Kevin Blighe37k

Kevin, IMO when someone uses the terms ref and alt, they should have the reference and alternate alleles in the columns. The authors are the ones responsible for this confusion, and they are calling G the ref allele. Seriously, scientists, get your heads straight. Don't call the major allele the ref allele.

ADD REPLYlink written 4 months ago by RamRS20k

Right you are, Ram!

ADD REPLYlink written 4 months ago by Kevin Blighe37k

Agree with you both, but have accepted this answer as it confirmed my suspicion about what was going on

ADD REPLYlink written 4 months ago by
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 878 users visited in the last hour