Multiple Entries With The Same Rs Number In Dbsnp131
2
10
Entering edit mode
12.0 years ago
Biomed 4.8k

when I run the query

mysql> select * from snp131 where name='rs67465876';
+------+-------+------------+-----------+------------+-------+--------+---------+---------+----------+---------+--------+-------------------------------------+-------+---------+----------------+---------+--------+
| bin  | chrom | chromStart | chromEnd  | name       | score | strand | refNCBI | refUCSC | observed | molType | class  | valid                               | avHet | avHetSE | func           | locType | weight |
+------+-------+------------+-----------+------------+-------+--------+---------+---------+----------+---------+--------+-------------------------------------+-------+---------+----------------+---------+--------+
|  585 | chr1  |      13667 |     13668 | rs67465876 |     0 | +      | G       | G       | A/G      | genomic | single | by-cluster,by-hapmap,by-1000genomes |   0.5 |       0 | untranslated-3 | exact   |      3 |
| 1457 | chr2  |  114357349 | 114357350 | rs67465876 |     0 | -      | T       | T       | A/G      | genomic | single | by-cluster,by-hapmap,by-1000genomes |   0.5 |       0 | unknown        | exact   |      3 |
| 1037 | chrY  |   59359060 |  59359061 | rs67465876 |     0 | -      | C       | C       | A/G      | genomic | single | by-cluster,by-hapmap,by-1000genomes |   0.5 |       0 | unknown        | exact   |      3


as you can see I get three different rows for the same RS number. Why is there three different snps on three different chromosomes with the same rs numbers ? How shall I treat these entries? Thanks

snp mapping dbsnp • 5.2k views
9
Entering edit mode
12.0 years ago

The problem stems from multiple locations in the genome where the SNP sequence can align almost perfectly. If you enter the SNP sequence into BLAT (ttagtgcccgttggagaaaacgggaatccctaagaaatggtgggtcctggccatccgtgag) you get

browser details YourSeq           59     1    61    61  98.4%     2   -  114357320 114357380     61
browser details YourSeq           58     1    61    61  98.4%    15   -  102517472 102517534     63
browser details YourSeq           58     1    61    61  98.4%    16   +      63317     63379     63
browser details YourSeq           58     1    61    61  98.4%     1   +      13636     13698     63
browser details YourSeq           56     1    61    61  96.8%     Y   -   59359031  59359093     63
browser details YourSeq           56     1    61    61  96.8%     X   -  155256025 155256087     63
browser details YourSeq           56     1    61    61  96.8%    12   -      91915     91977     63
browser details YourSeq           56     1    61    61  96.8%     9   +      13749     13811     63
browser details YourSeq           23    13    38    61  96.0%    13   +   48052536  48052566     31


which includes the chromosome 2, 1, and Y loci as well as several others. This suggests that the annotation confusion is coming from the alignment. The first two entries are family members (Ddx11l2 Ddx11l9), probably duplicates. I would guess that this SNP should be removed from your analysis as it would be hard to say which location is providing the signal; probably the answer is "all of them".

7
Entering edit mode
12.0 years ago

the rs number assigning process heavily relies on clustering methods based on alignments, so all the alignment biases apply here. the oficial reason given by the dbSNP project why one SNP entry may be assigned to several locations is located on dbSNP's mapping process manual:

SNPs can map to multiple places within a single chromosome if:

• The flanking sequence submitted with the SNP is too short.
• The SNP happens to map to a repetitive region of the chromosome.
• There happen to be variations within the SNP flanking sequence.

one just need to think about how dbSNP actually works in order to understand its limitations. dbSNP receives lots of entries batches per day, where entries are mainly the variation sites surrounded by the genomic flanking sequence. each entry is assigned a ss number after a quality check process and then studied, in order to be mapped and clustered with others. once everything is standarized, dbSNP can be sure that submission repeats may not be redundant on their system, and therefore an rs number is assigned if that ss number represents a new entry (if not, they don't have to change anything). there is a very clarifying figure on the Wikipedia's dbSNP entry.