Question

Using ENSEMBL API to fetch the snps for the list of genes

0

Entering edit mode

2.8 years ago

anasjamshed ▴ 140

I have 422 gene lists and I want to fetch ONLY the following rows:

Gene
Variant ID
Location
vf_allele
Alleles
Clin. Sig.
Conseq. Type
cadd
revel_sort
meta_lr_sort
mutation_assessor_sort
publications

through Ensembl REST API

I am trying this code:

import requests, sys

server = "http://rest.ensembl.org"
ext = "/variant_recoder/homo_sapiens"
headers={ "Content-Type" : "application/json", "Accept" : "application/json"}
r = requests.post(server+ext, headers=headers, data='{ "ids" : ["rs56116432", "rs1042779" ] }')

if not r.ok:
  r.raise_for_status()
  sys.exit()

decoded = r.json()
print(repr(decoded))

This code asking me rs IDs but I want to input a list of 422 genes. Is this possible?

ensembl • 3.3k views

ADD COMMENT • link updated 23 months ago by Ram 44k • written 2.8 years ago by anasjamshed ▴ 140

score 0 · Answer 1 · 2022-01-04

0

Entering edit mode

2.8 years ago

Ben Moore ★ 2.4k

Hi anasjamshed1994,

The variant recoder endpoint converts variants from one notation format to another (e.g rs IDs -> HGVS format).

You can use the GET overlap/id endpoint to retrieve a list of variants that overlaps your genes of interest: http://rest.ensembl.org/documentation/info/overlap_id

ADD COMMENT • link 2.8 years ago by Ben Moore ★ 2.4k

0

Entering edit mode

Could you please explain more.How can I input 400 genes in overlap id

ADD REPLY • link 2.8 years ago by anasjamshed ▴ 140

0

Entering edit mode

The GET overlap endpoint only allows single gene ID per query, so you will need to create a loop within your script to submit each gene ID separately

ADD REPLY • link 2.8 years ago by Ben Moore ★ 2.4k

0

Entering edit mode

Can you please help me to make loop?

ADD REPLY • link 2.8 years ago by anasjamshed ▴ 140

0

Entering edit mode

This depends on the language you are using to query the REST API but you will need to create a list of your gene IDs then create a for loop, substituting the gene ID into the URL.

e.g in Python: https://www.w3schools.com/python/python_for_loops.asp

ADD REPLY • link 2.8 years ago by Ben Moore ★ 2.4k

0

Entering edit mode

Get overlap will take only ensemble ids as input but i want to put gene symbols

ADD REPLY • link 2.8 years ago by anasjamshed ▴ 140

0

Entering edit mode

No problem- then you'll need to combine this with the POST lookup/symbol endpoint to retrieve the Ensembl stable IDs associated with each gene symbol: http://rest.ensembl.org/documentation/info/symbol_post

A POST endpoint is available in this case, so you can submit all gene symbols in a single query.

ADD REPLY • link 2.8 years ago by Ben Moore ★ 2.4k

0

Entering edit mode

Ben, I am trying this code now:

import requests, sys
# List of genes to search for
list1= open("test.txt").read()
# split line by "," into list of strings
geneList = list1.rstrip().split("\n")

import requests, sys
server = "https://rest.ensembl.org"
for i in geneList:
    ext = "/lookup/symbol/homo_sapiens/"
    r = requests.post(server+ext, headers={ "Content-Type" : "application/json", "Accept" : "application/json"}, data=str(geneList))
    decoded = r.json()

    if not r.ok:
        r.raise_for_status()
        sys.exit()
    decoded = r.json()
    print(repr(decoded))

But it is giving me errors:

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

ADD REPLY • link 2.8 years ago by anasjamshed ▴ 140

0

Entering edit mode

If you are just using the POST lookup endpoint, you don't need to include the loop. Something like this will print the full output for each gene symbol in your list:

import requests, sys, json
from pprint import pprint

def fetch_endpoint(server, request, content_type):

    r = requests.get(server+request, headers={ "Accept" : content_type})

    if not r.ok:
        r.raise_for_status()
        sys.exit()

    if content_type == 'application/json':
        return r.json()
    else:
        return r.text

def fetch_endpoint_POST(server, request, data, content_type):

    r = requests.post(server+request,
                      headers={ "Accept" : content_type},
                      data=data )

    if not r.ok:
        r.raise_for_status()
        sys.exit()

    if content_type == 'application/json':
        return r.json()
    else:
        return r.text

# define the server, extension and content type
server = "http://rest.ensembl.org/"
con = "application/json"
ext = "lookup/symbol/homo_sapiens/"

# create the list of gene symbols
gene_names = ["BRCA2", "ESPN"]

# convert the list into json format
data = json.dumps({ "symbols" : gene_names })

# run the query
post_lookup = fetch_endpoint_POST(server, ext, data, con)

#print the output
pprint (post_lookup)

Module 6 of the Ensembl REST API online course will teach you how to use the POST endpoints: https://www.ebi.ac.uk/training/online/courses/ensembl-rest-api/

ADD REPLY • link 2.8 years ago by Ben Moore ★ 2.4k

0

Entering edit mode

Thanks. Now i am trying this code:

import requests, sys
# List of genes to search for
list1= open("test.txt").read()
# split line by "," into list of strings
geneList = list1.rstrip().split("\n")

import requests, sys, json
from pprint import pprint

def fetch_endpoint(server, request, content_type):

    r = requests.get(server+request, headers={ "Accept" : content_type})

    if not r.ok:
        r.raise_for_status()
        sys.exit()

    if content_type == 'application/json':
        return r.json()
    else:
        return r.text

def fetch_endpoint_POST(server, request, data, content_type):

    r = requests.post(server+request,
                      headers={ "Accept" : content_type},
                      data=data )

    if not r.ok:
        r.raise_for_status()
        sys.exit()

    if content_type == 'application/json':
        return r.json()
    else:
        return r.text

# define the server, extension and content type
server = "http://rest.ensembl.org/"
con = "application/json"
ext = "lookup/symbol/homo_sapiens/"

# create the list of gene symbols
gene_names = geneList

# convert the list into json format
data = json.dumps({ "symbols" : gene_names })

# run the query
post_lookup = fetch_endpoint_POST(server, ext, data, con)

#print the output
pprint (post_lookup)

But it is giving me just information about mapped genes:

I want SNPs related to my genes using REST API

ADD REPLY • link 2.8 years ago by anasjamshed ▴ 140

0

Entering edit mode

That's correct. You can use the POST Lookup/symbol endpoint to retrieve the Ensembl stable gene IDs. You will then need to use the list of stable IDs in the GET Overlap/id endpoint (using the loop) to retrieve the variants overlapping your genes of interest.

ADD REPLY • link 2.8 years ago by Ben Moore ★ 2.4k

0

Entering edit mode

Thanks .

I have found ids of all genes through:

import requests, sys
import pandas as pd
# List of genes to search for
list1= open("test.txt").read()
# split line by "," into list of strings
geneList = list1.rstrip().split("\n")

import requests, sys, json
from pprint import pprint

def fetch_endpoint(server, request, content_type):

    r = requests.get(server+request, headers={ "Accept" : content_type})

    if not r.ok:
        r.raise_for_status()
        sys.exit()

    if content_type == 'application/json':
        return r.json()
    else:
        return r.text

def fetch_endpoint_POST(server, request, data, content_type):

    r = requests.post(server+request,
                      headers={ "Accept" : content_type},
                      data=data )

    if not r.ok:
        r.raise_for_status()
        sys.exit()

    if content_type == 'application/json':
        return r.json()
    else:
        return r.text

# define the server, extension and content type
server = "http://rest.ensembl.org/"
con = "application/json"
ext = "lookup/symbol/homo_sapiens/"

# create the list of gene symbols
gene_names = geneList

# convert the list into json format
data = json.dumps({ "symbols" : gene_names })

# run the query
post_lookup = fetch_endpoint_POST(server, ext, data, con)

#print the output
#pprint (post_lookup)

# Load the data into a pandas dataframe
genes = pd.DataFrame.from_dict(post_lookup, orient="index")
Ensembl_symbols = genes["id"]

#print(Ensembl_symbols)
Ensembl_symbols.to_csv("IDS.csv")

Now I want to use the list of these ids to fetch variants

see:

import requests, sys
# List of genes to search for
list1= open("idensembl.txt").read()
# split line by "," into list of strings
geneList = list1.rstrip().split("\n")

import requests, sys
server = "https://rest.ensembl.org"
for i in geneList:
    ext = "/overlap/id/"+i+"?feature=genes"
    r = requests.post(server+ext,headers={ "Content-Type" : "application/json"})
    decoded = r.json()

    if not r.ok:
        r.raise_for_status()
        sys.exit()
    decoded = r.json()
    print(repr(decoded))

But it give me an error:

HTTPError: 400 Client Error: Bad Request for url: https://rest.ensembl.org/overlap/id/ENSG00000069188?feature=genes

How can i solve it?

ADD REPLY • link 2.8 years ago by anasjamshed ▴ 140

1

Entering edit mode

To retrieve variants overlapping your genes of interest, you will need to use the feature=variation optional parameter. So, your URL should look like the following: https://rest.ensembl.org/overlap/id/ENSG00000069188?feature=variation

In your script, the extension should look like this:

ext = "/overlap/id/"+i+"?feature=variation"

ADD REPLY • link 2.8 years ago by Ben Moore ★ 2.4k

0

Entering edit mode

Thanks, Ben. I have finished my script like:

import requests, sys
import pandas as pd
# List of genes to search for
list1= open("test.txt").read()
# split line by "," into list of strings
geneList = list1.rstrip().split("\n")

import requests, sys, json
from pprint import pprint

def fetch_endpoint(server, request, content_type):

    r = requests.get(server+request, headers={ "Accept" : content_type})

    if not r.ok:
        r.raise_for_status()
        sys.exit()

    if content_type == 'application/json':
        return r.json()
    else:
        return r.text

def fetch_endpoint_POST(server, request, data, content_type):

    r = requests.post(server+request,
                      headers={ "Accept" : content_type},
                      data=data )

    if not r.ok:
        r.raise_for_status()
        sys.exit()

    if content_type == 'application/json':
        return r.json()
    else:
        return r.text

# define the server, extension and content type
server = "http://rest.ensembl.org/"
con = "application/json"
ext = "lookup/symbol/homo_sapiens/"

# create the list of gene symbols
gene_names = geneList

# convert the list into json format
data = json.dumps({ "symbols" : gene_names })

# run the query
post_lookup = fetch_endpoint_POST(server, ext, data, con)

#print the output
#pprint (post_lookup)

# Load the data into a pandas dataframe
genes = pd.DataFrame.from_dict(post_lookup, orient="index")
Ensembl_symbols = genes["id"]
ID= pd.DataFrame(Ensembl_symbols)

EnsID = ID["id"]

for i in EnsID:
    ext = "/overlap/id/"+i+"?feature=variation"
    r = requests.get(server+ext,headers={ "Content-Type" : "application/json"})
    decoded = r.json()

    if not r.ok:
        r.raise_for_status()
        sys.exit()
    decoded = r.json()
    variations = pd.DataFrame(decoded)
    print(variations)
    variations.to_csv("Variations.csv")

and it's giving me 8009 SNPs: SNPs

But when I move variations.to_csv("Variations.csv") to outside the loop then the SNPs reduce to 7569. What will be the reason?

ADD REPLY • link 2.8 years ago by anasjamshed ▴ 140

0

Entering edit mode

Can you try within loop:

variations.to_csv(i+"Variations.csv")

or

variations.to_excel(i+"Variations.xlsx")

within loop, i guess you are overwriting the csv, every time loop is run. global csv might be the last csv. Remember in each loop (for each gene), different set of columns and in different order come out. Be careful while merging them.

ADD REPLY • link 2.8 years ago by cpad0112 21k