Strange reference allele convention
1
0
Entering edit mode
5.5 years ago

Hey all,

I downloaded the accompanying snps from the paper:

Genome-wide identification of splicing QTLs in the human brain and their enrichment among schizophrenia-associated loci Atsushi Takata, Naomichi Matsumoto & Tadafumi Kato

but found some strange alignment to the reference genome a small number of the 8966 snps. An example is:

snp     pos               ref    alt
chr11  66329732 G   A/G

whereas I make this reference base to be A on hg19.

I am not sure what is going on here, and was wondering if I am missing a convention with how these snps are report so I can get the correct alignment. Is it that 'ref' is not actually the reference base, but the allele with higher frequency? Given this isn't reported in the paper it makes it quite tricky to ensure the correct alignment of snps to the reference genome.

alignment SNP • 972 views
ADD COMMENT
2
Entering edit mode

Odds are they're using their own "convention" to denote a data feature using an option that is not commonly used for other notations. sigh, when will people ever stop doing this?

EDIT: You should email the authors. The ref allele is A and the variant is A>G. There is no rationale behind why they'd call the reference G.

ADD REPLY
3
Entering edit mode
5.5 years ago

Hey,

Please take a look at my answer (and that of Emily), here: A: Alternate nucleotide is more frequent than reference nucleotide. OMG I'm dizzy.

The 'reference' base is strictly the base that appears in the reference genome, which may or may not have an association to a particular disease / phenotype and which may or may not be the allele with the highest allele frequency in the population of interest. Indeed, hg19 / GRCh37 contains many thousands of bases that are, in fact, the minor allele.

Relating to the example that you have picked out, the rs ID is rs540874. While A is indeed the base in the reference genomes from GRCh at this position, G is actually the major allele across all populations.

This confusion has undoubtedly resulted in many errors in published literature and also in clinical reports from both hospitals and private companies, but what can one do? It means that, unless one is aware of each and every case like this, then it may appear that a patient has a disease phenotype when, in fact, it is the dude who supplied his DNA to create hg19 who has the disease phenotype. hg38 has the same issues.

I do not believe it was ever strictly and clearly stated by the GRC the limitations surrounding the usage of their reference genomes in re-sequencing studies. There was a publication on this years ago but it is surprisingly (or, unsurprisingly?) difficult to find. Something of this level of importance should have been published in all of the top journals.

Kevin

ADD COMMENT
2
Entering edit mode

Kevin, IMO when someone uses the terms ref and alt, they should have the reference and alternate alleles in the columns. The authors are the ones responsible for this confusion, and they are calling G the ref allele. Seriously, scientists, get your heads straight. Don't call the major allele the ref allele.

ADD REPLY
1
Entering edit mode

Right you are, Ram!

ADD REPLY
1
Entering edit mode

Agree with you both, but have accepted this answer as it confirmed my suspicion about what was going on

ADD REPLY

Login before adding your answer.

Traffic: 2948 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6