Question

how to find genes based on SNP rsIDs

0

Entering edit mode

6.4 years ago

kharis.b9770 • 0

I have 8 SNPs meeting sigificance in a GWAS and need to find which genes they are associated with. I have searched by rsIDs using NCBI gene though can't find any association and it doesn't appear there are any associations in the literature. How can I look these up? I have searched them in the UCSC genome browser, though am not sure where to look to find associated genes. Please help

rsID GWAS SNP • 3.8k views

ADD COMMENT • link updated 6.4 years ago by Alex Reynolds 35k • written 6.4 years ago by kharis.b9770 • 0

0

Entering edit mode

Did you try searching on NCBI dbSNP?

ADD REPLY • link 6.4 years ago by Ram 43k

0

Entering edit mode

In addition to Ram's suggestion, if you know flanking sequences, just do NCBI BLAST search.

ADD REPLY • link 6.4 years ago by second_exon ▴ 210

score 0 · Answer 1 · 2017-12-15

If you want a fast (binary) search with a locally-stored dataset, where you know exactly what SNPs and gene annotations are being queried, here's a way to set up files to do your own querying, which scales up very well when you have a lot of SNPs.

1) Get annotations of interest and write them into a sorted BED file.

For example, here's a way to get Gencode v26 gene annotations:

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_26/gencode.v26.basic.annotation.gff3.gz \
    | gunzip -c \
    | convert2bed --input=gff - \
    > gencode.v26.bed

2) Get SNPs and write them into a text file sorted by SNP ID.

For hg38, for instance:

$ LC_ALL=C
$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/snp147.txt.gz \
   | gunzip -c \
   | awk -v OFS="\t" '{ print $5,$2,$3,($3+1) }' \
   | sort -k1,1 \
   > hg38.snp147.sortedByName.txt

3) Install pts-line-bisect:

$ git clone https://github.com/pts/pts-line-bisect.git
$ cd pts-line-bisect && make && cd ..

4) Run a query:

$ rs_of_interest=rs2814778
$ ./pts-line-bisect/pts_lbsearch -p hg38.snp147.sortedByName.txt ${rs_of_interest} \
   | head -1 \
   | cut -f2- \
   | bedmap --echo --echo-map-id-uniq --range 1000 - gencode.v26.bed \
   > answer.bed

The file answer.bed contains the genomic interval of the SNP of interest, along with the IDs of any Gencode v26 gene annotations within 1000 nt upstream and downstream of the SNP.

To repeat step 4 within a Python script, you can use subprocess.check_output():

...
pts_lbsearch = os.path.join(project_dir, "pts-line-bisect", "pts_lbsearch")
if not os.path.exists(pts_lbsearch):
  print "Content-type:application/json\r\n\r\n"
  print json.dumps({'error' : 'could not locate pts_lbsearch [%s]' % (pts_lbsearch)})
  sys.exit(errno.ENOENT)

if term.startswith('rs'):
  query_fn = os.path.join(annotations_dir, "hg38.snp147.sortedByName.txt")
  cmd = "%s -p %s %s" % (pts_lbsearch, query_fn, term)
  try:
    res = subprocess.check_output(cmd, shell=True)
    if res and len(res) > 0:
      (chr, start, stop) = res.strip().split('\n')[0].split('\t')
    else:
      pass
  except subprocess.CalledProcessError as err:
    sys.stderr.write("Query failed!\n")
    sys.exit(errno.EIO)

#
# map 'chr', 'start', and 'stop' variables to annotations with 
# another subprocess call (not shown -- use example above)
#
...