Inconsistencies between snp coordinates and alleles across references
1
1
Entering edit mode
7.5 years ago
Max ▴ 140

While trying to identify the alleles at several snp loci in a data set, I found a number of discrepancies between both the reference/variant alleles and the coordinates when comparing ucsc reference genome, refsnp on ncbi, and ensembl's dbsnp data, even though they allegedly all use the same reference genome hg19 or the equivalent gchr73. In most cases, the nucleotide at the hg19 coordinates corresponding to the ensembl snp is correct (though ancestral/variant is sometimes reversed). However, if I enter the rs# into UCSC using hg19, it consistently gives a different coordinate for the snp. Moreover, when I look up the same SNP on NCBI's refsnp, the nucleotides are completely different, as though a complementary strand were being used. Since rs# are not dependent on assembly (unlike their coordinates), this shouldn't be the case. I have appended some examples below, and would greatly appreciate a clarification:

rs1063192
ensembl grch37 coordinates = chr9:22003367
reference allele at these coordinates according to UCSC hg19 for chr9:2203367 = G
ensembl ancestral/variant = G/A
ucsc hg19 coordinates for rs1063192 = chr9:22003117-22003617
ncbi refsnp alleles C/T

rs601620
ensembl grch37 coordinates = chr20:62309839
reference allele according to UCSC h19 for these coordinates:A
ensembl ancestral/variant = A/G
ucsc hg19 coordinates for rs601620 = chr20:62309589-62310089
ncbi refsnp alleles C/T

rs498872
ensembl grch37 coordinates: chr11:118477367
reference allele at these coordinates according to UCSC hg91 = A
ensembl ancestral/variant = G/A
ucsc hg19 coordinates for rs498872 = chr11:118477117-118477617
ncbi refsnp alleles C/T


For comparison, one that is (mostly) consistent:

rs10079250
ensembl grch37 coordinates = chr5:14950132
reference allele according to UCSC hg19 for chr5:14950132 = C
ensemble ancestral/variant = T/C
ucsc hg19 coordinates = chr5:149449882-149450382
ncbi refsnp alleles C/T

UCSC Ensembl SNP dbSNP • 2.6k views
1
Entering edit mode

Please note that the variants are not listed as ancestral/variant, but reference/alternative. The reference is just the base that was found in the individual from whom that contig of the genome was taken. It may be the minor allele, the non-ancestral allele or even the disease-causing allele, if that was the allele that individual had.

However your final example, rs10079250, indicates a variant where dbSNP do not follow this convention, as they have indicated the alternative allele first, followed by the reference. Note that in this case the T is in fact the major allele and the ancestral allele.

0
Entering edit mode
7.5 years ago

The UCSC coordinates are for an interval (+/-250bp) flanking the SNP position.

In your first example (rs1063192), the locations are identical and Ensembl and NCBI SNPs are merely the reverse complement of each other (as are the flanking sequences, if you examine those entries). Ensembl has annotated the SNP relative to the gene CDKN2B, which is on the reverse strand, while NCBI annotates it relative to the genome (i.e., top strand).

2
Entering edit mode

Ensembl variants are always mapped to the positive strand. dbSNP maps to either but this is usually indicated in the record. For your first example, see in Ensembl, positive strand is indicated (and will always be indicated), whereas the dbSNP record gives the strand, in this case negative.

This is because dbSNP is an archive, where people can submit their own variants and choose to submit them in their preferred style. This means they often find their alleles in the context of a (sometimes negatively stranded) gene, so these are the ones they submit to the database. Ensembl, in contrast, imports these data so can convert the data into an easy-to-follow consistent style so as to prevent confusion like this.

0
Entering edit mode

Emily_Ensembl is correct (thanks for catching my mistake). I inverted the description of Ensembl (which is genome-centric) and NCBI dbSNP (usually gene-centric) entries.

0
Entering edit mode

I see...I forgot that whether the site is read from the positive or reverse strand depends on the orientation of the gene, and that snps are generally defined with respect to genes. This explains the inconsistency, thanks.

0
Entering edit mode

You may also be interested in this paper which discusses allele nomenclature.