How to search dbSNP using a list of SNPs and retrieve Gene name (hgnc symbol if existing, otherwise just whatever is in there)
2
1
Entering edit mode
8 months ago
f-rasmussen ▴ 10

I have a list of 500.000 SNPs from which I want to obtain the gene name. I try to search with biomaRt

library(data.table)
library(biomaRt)

rs <- fread("SNPs.txt")
ensembl_version = "https://dec2016.archive.ensembl.org"
ensembl <- useMart("ENSEMBL_MART_SNP", dataset = "hsapiens_snp")


getBM(attributes=c("refsnp_id", "associated_gene"), filters="snp_filter", values=rs, mart=ensembl, uniqueRows=TRUE)

However many of the SNPs return NA or simply nothing. Show here:

      refsnp_id associated_gene
1      rs425277           PRKCZ
2     rs1571149                
3     rs1240707                
4     rs1240708                
5      rs873927                
6      rs880051           SSU72
7      rs904589                
8      rs908742                
9      rs909823                
10     rs925905                
11       rs7290                
12       rs7407                
13    rs1878745                
14    rs2296716           SSU72
15    rs2298217                
16    rs2459994

When I search some of the rsIDs which did not produce a gene name on dbSNP, they are in fact associated with a gene name in the database. My question is then, how can I connect biomaRt to dbSNP and retrieve the correct gene names for all the SNPs in the list 'SNPs.txt'?

biomaRt R • 1.3k views
ADD COMMENT
0
Entering edit mode
8 months ago
GenoMax 110k

Using EntrezDirect:

$ more id
rs425277
rs1571149
rs1240707
rs1240708
rs873927
rs880051
rs904589

$ for i in `cat ./id`; do esearch -db snp -query "${i} AND human [orgn]" | esummary | xtract -pattern DocumentSummary -element SNP_ID,NAME | uniq ; done
425277  PRKCZ
1571149 TMEM240
1240707 CCNL2   MRPL20-AS1
1240708 CCNL2   MRPL20-AS1
873927  LINC01770   LOC107985729
880051  SSU72
904589  ANKRD65 LOC105378585
ADD COMMENT
0
Entering edit mode
8 months ago

Via Python and the biothings_client library:

#!/usr/bin/env python

from biothings_client import get_client
import sys

variants = [
    'rs425277',
    'rs1571149',
    'rs1240707',
    'rs1240708',
    'rs873927',
    'rs880051',
    'rs1878745',
    'rs2296716',
    'rs2298217',
    'rs2459994'
]
vm = {x : None for x in variants}
vc = get_client('variant')
rs = vc.querymany(variants, scopes='dbsnp.rsid', fields='all', verbose=False)
for r in rs:
    q = r['query']
    if 'dbsnp' in r:
        d = r['dbsnp']
        if 'gene' in d:
            g = d['gene']
            if 'symbol' in g and not vm[q]:
                vm[q] = g['symbol']
for v in variants:
    sys.stdout.write('{}\t{}\n'.format(v, vm[v]))

Sample output:

$ ./get_hgnc_symbols_for_variants.py
rs425277    PRKCZ
rs1571149   TMEM240
rs1240707   None
rs1240708   None
rs873927    None
rs880051    SSU72
rs1878745   PRKCZ
rs2296716   SSU72
rs2298217   None
rs2459994   PRKCZ

There are lots of other attributes you could inspect and retrieve, by editing the for r in rs loop.

ADD COMMENT
0
Entering edit mode

Hi Alex! Thanks for this information. I have a list of SNPs in a .csv file along with other columns. Can you please tell me how I can supply those SNPs in the csv file in place of the following format?

variants = [ 'rs425277', 'rs1571149', 'rs1240707', 'rs1240708', 'rs873927', 'rs880051', 'rs1878745', 'rs2296716', 'rs2298217', 'rs2459994' ]

Thanks in advance

ADD REPLY
1
Entering edit mode

The exact code will depend on your CSV file:

import io
import pandas as pd

csv_as_string = '''a,b,c,d,e,f'''

variants = pd.read_csv(io.StringIO(csv_as_string), header=None, skipinitialspace=True)
# chop up variants, as needed
ADD REPLY
0
Entering edit mode

Thanks Alex

ADD REPLY

Login before adding your answer.

Traffic: 2338 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6