Automated dbSNP rsID position lookup

Question

Tutorial:Automated dbSNP lookup by rsID position, plus genome build liftover

6

Entering edit mode

20 months ago

Kevin Blighe 87k

Hola, just passing by to say 'hi'.

Please post bugs / suggestions as comments to this tutorial.

Automated dbSNP rsID position lookup

rsID to position

GRCh38

cat rsids.list
rs1296488112
rs1226262848
rs1225501837
rs1484860612
rs1235553513
rs1424506967

cat rsids.list | while read rsid ;
do
  pos=$(curl -sX GET "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=$rsid&retmode=text&rettype=text" | sed 's/<\//\n/g' | grep -o -P '\<CHRPOS\>.{0,15}' | cut -f2 -d">") ;
  echo -e "${pos}""\t""${rsid}" ;
done ;

1:13616 rs1296488112
1:13621 rs1226262848
1:13623 rs1225501837
1:13634 rs1484860612
1:13646 rs1235553513
1:13647 rs1424506967

rsID to position

GRCh37

This is the same as above, except that the tag returned that contains the GRCh37 co-ordinates is CHRPOS_PREV_ASSM, not CHRPOS:

cat rsids.list
rs1296488112
rs1226262848
rs1225501837
rs1484860612
rs1235553513
rs1424506967

cat rsids.list | while read rsid ;
do
  pos=$(curl -sX GET "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=$rsid&retmode=text&rettype=text" | sed 's/<\//\n/g' | grep -o -P '\<CHRPOS_PREV_ASSM\>.{0,15}' | cut -f2 -d">") ;
  echo -e "${pos}""\t""${rsid}" ;
done ;

1:13616 rs1296488112
1:13621 rs1226262848
1:13623 rs1225501837
1:13634 rs1484860612
1:13646 rs1235553513
1:13648 rs1424506967

Position to rsID

GRCH38

Here, we have to be wary that more than a single rsID can map to a single position.

cat grch38.bed
chr1    13616   13616
chr1    13621   13621
chr1    13623   13623
chr1    13634   13634
chr1    13646   13646
chr1    13647   13647

sed 's/^chr//g' grch38.bed | cut -f1,2 | while read chr pos ;
do
  rsid=$(curl -sX GET "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=snp&term=$pos%5BPOSITION%5D+AND+$chr%5BCHR%5D" | sed 's/<\//\n/g' | grep -e "<Id>" | cut -f2 -d">" | awk '{print "rs"$0}' | tr '\n' ',' | sed 's/,$//g') ;
  echo -e "${chr}"":""${pos}""\t""${rsid}" ;
done ;

1:13616 rs1296488112
1:13621 rs1226262848
1:13623 rs1225501837
1:13634 rs1639595957,rs1484860612
1:13646 rs1235553513
1:13647 rs1424506967

GRCH37

cat grch37.bed
chr1    13616   13616
chr1    13621   13621
chr1    13623   13623
chr1    13634   13634
chr1    13646   13646
chr1    13648   13648

sed 's/^chr//g' grch38.bed | cut -f1,2 | while read chr pos ;
do
  rsid=$(curl -sX GET "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=snp&term=$pos%5BPOSITION_GRCH37%5D+AND+$chr%5BCHR%5D" | sed 's/<\//\n/g' | grep -e "<Id>" | cut -f2 -d">" | awk '{print "rs"$0}' | tr '\n' ',' | sed 's/,$//g') ;
  echo -e "${chr}"":""${pos}""\t""${rsid}" ;
done ;

1:13616 rs1296488112
1:13621 rs1226262848
1:13623 rs1225501837
1:13634 rs1639595957,rs1484860612
1:13646 rs1235553513
1:13648 rs1424506967

.

Automated genome build lift-over

For automated lift-over in R via rtracklayer, please try this:

  # chain file for hg19 to hg38
    download.file('http://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz', 'hg19ToHg38.over.chain.gz')
    R.utils::gunzip('hg19ToHg38.over.chain.gz')

  # chain file for hg38 to hg19
    download.file('http://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz', 'hg38ToHg19.over.chain.gz')
    R.utils::gunzip('hg38ToHg19.over.chain.gz')

  # import variant positions
    positions <- read.table('grch37.bed', sep = '\t', header = FALSE, stringsAsFactors = FALSE)

  # perform liftover, assuming hg19 to hg38
    library(rtracklayer)
    grObject <- GRanges(
      seqnames = positions[,1],
      ranges = IRanges(start = positions[,2], end = positions[,2]))
    chainObject <- import.chain('hg19ToHg38.over.chain')
    as.data.frame(liftOver(grObject, chainObject))[,c('seqnames','start','end','width')]

  seqnames start   end width
1     chr1 13616 13616     1
2     chr1 13621 13621     1
3     chr1 13623 13623     1
4     chr1 13634 13634     1
5     chr1 13646 13646     1
6     chr1 13647 13647     1

Gracias,

Kevin

rsid hg19 liftover hg38 dbsnp • 2.5k views

ADD COMMENT • link updated 12 months ago by Randy H ▴ 110 • written 20 months ago by Kevin Blighe 87k

1

Entering edit mode

And for those looking for a python solution:

import requests
from bs4 import BeautifulSoup

url='https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=rs1296488112&retmode=text&rettype=text'
htmlContent = requests.get(url, verify=True)
soup = BeautifulSoup(htmlContent.text)
chromo = soup,chr.string                           #  '1' in this example
list(soup.gene_e.children)[0].name                 #  'name' in this example which is a special method in BeautifulSoup so ....
gname = list(soup.gene_e.children)[0].string       #    simply assume it is the first element (or write code to check that it is before using)

When an rsID you are trying to lookup does not exist:

url="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=rs000&retmode=text&rettype=text"        # Bad rsID
if soup.string == 'ID+list+is+empty!+Possibly+it+has+no+correct+IDs.\n' :                  # Returned string when rsID lookup failed
     ....

dbSNP Entrez does have a rate limit of 3 queries per ??? seconds, so:

import time
if soup.string == '{"error":"API rate limit exceeded","api-key":"104.185.79.219","count":"4","limit":"3"}\n' :
     # Requests coming in too fast; need to throttle
     time.sleep(5)                    # Sleep 5 seconds and try again
     htmlContent = requests.get(url, verify=True)
     soup = BeautifulSoup(htmlContent.text)
 # Or see a more proper solution using rate_limit : https://stackoverflow.com/a/50441345

To search by position, entrez does not seem to have a direct search link. So you have to do a URL based form search in their user web interface:

url='https://www.ncbi.nlm.nih.gov/snp/?term=1[CHR]+AND+13616[POSITION]+AND+%22homo+sapiens%22[Organism]'
#  Use POSITION_GRCH37 if not Build 38 (the default)
# The returned HTML code is a full browser (user) page. HTML decoding is too complicated to show succinctly here.
# But let me know and I can create a GIST to show how I extract the info; and test for when there is no rsID at the position
# Note: you can get more than one rsID returned per position

ADD REPLY • link 12 months ago by Randy H ▴ 110

0

Entering edit mode

Hi,

Thanks for this useful script!

It works as expected most of the time, but I've notice that sometimes it comes up empty, I guess this has to do with internet speed at the moment of query.

For example, I wanted to find the HG38 postion of a list of SNPs (rs IDs), which contained rs17728338. The position for this SNP came back empty in the list, but when I looked it up manually, it did find the position. Maybe there is a way to specify that it should re-try if it comes up empty at first?

Thanks, Annique

ADD REPLY • link 14 months ago by Annique • 0

0

Entering edit mode

Hey Annique, it could be that the server is becoming overloaded with requests, and blocks some of the requests. Can you add the sleep command to the while do loops: For example:

sed 's/^chr//g' grch38.bed | cut -f1,2 | while read chr pos ;
do
  rsid=$(curl -sX GET "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=snp&term=$pos%5BPOSITION%5D+AND+$chr%5BCHR%5D" | sed 's/<\//\n/g' | grep -e "<Id>" | cut -f2 -d">" | awk '{print "rs"$0}' | tr '\n' ',' | sed 's/,$//g') ;
  echo -e "${chr}"":""${pos}""\t""${rsid}" ;
  sleep 5
done ;

ADD REPLY • link 14 months ago by Kevin Blighe 87k