EBI API GWAS query over gene
1
2
Entering edit mode
3 months ago
sim.j.baum ▴ 110

Hi,
I would like to get SNPs associated with a gene from EBI over API request in Python. I found this tutorial:

https://colab.research.google.com/github/EBISPOT/GWAS_Catalog-workshop/blob/master/notebooks/workshop_01.ipynb#scrollTo=1rraZwYctEud

Which is helpful to get a trait or SNPs, which is already good! However, I have to say that I am not really good with APIs and how to change this query to a gene query.

This is from the tutorial:

# API Address:
apiUrl = 'https://www.ebi.ac.uk/gwas/summary-statistics/api'


trait = "EFO_0001360"
p_upper = "0.000000001"


requestUrl = '%s/traits/%s/associations?p_upper=%s&size=10' %(apiUrl, trait, p_upper)
response = requests.get(requestUrl, headers={ "Content-Type" : "application/json"})

# The returned response is a "response" object, from which we have to extract and parse the information:
decoded = response.json()
extractedData = []



def getGene(studyLink):
    # Accessing data for a single study:
    response = requests.get(studyLink, headers={ "Content-Type" : "application/json"})
    decoded = response.json()

    gwasData = requests.get(decoded['_links']['gwas_catalog']['href'], headers={ "Content-Type" : "application/json"})
    decodedGwasData = gwasData.json()

    traitName = decodedGwasData['diseaseTrait']['trait']
    pubmedId = decodedGwasData['publicationInfo']['pubmedId']

    return(traitName, pubmedId)


extractedData = []

for association in decoded['_embedded']['associations'].values():
    pvalue = association['p_value']
    variant = association['variant_id']
    studyID = association['study_accession']
    studyLink = association['_links']['study']['href']
    traitName, pubmedId = getStudy(studyLink)

    extractedData.append({'variant' : variant,
                          'studyID': studyID,
                          'pvalue' : pvalue,
                          'traitName': traitName,
                          'pubmedID': pubmedId}) 


ssWithGWASTable = pd.DataFrame.from_dict(extractedData)
ssWithGWASTable

In the decoded you get this:

'trait': [{'href': 'https://www.ebi.ac.uk/gwas/summary-statistics/api/traits/EFO_0001360'}],

which is I guess where to change it (maybe with /summary-staistics/gene/... ??). But I am not really good with APIs and hope to get some pointers or solutions here.

Thanks, Simon

GWAS API EBI Python • 534 views
ADD COMMENT
2
Entering edit mode
3 months ago
Mihai Todor ▴ 30

I could be wrong, but I think you might be able to achieve this using the eQTL RESTful API as described here.

For example, for gene TP53BP1 (ENSG00000067369), here are the first two results:

curl -s https://www.ebi.ac.uk/eqtl/api/genes/ENSG00000067369/associations?size=2 | jsonpp
{
  "_embedded": {
    "associations": {
      "0": {
        "qtl_group": "macrophage_IFNg",
        "se": 0.0312781,
        "beta": 0.0306393,
        "median_tpm": 3.798,
        "study_id": "Alasoo_2018",
        "neg_log10_pvalue": 0.4806052046189824,
        "rsid": "rs4573906",
        "chromosome": "15",
        "type": "SNP",
        "alt": "A",
        "position": 42512802,
        "ac": 124.0,
        "maf": 0.255952,
        "variant": "chr15_42512802_G_A",
        "ref": "G",
        "pvalue": 0.33067,
        "r2": 0.81584,
        "an": 168.0,
        "molecular_trait_id": "ENSG00000067369",
        "gene_id": "ENSG00000067369",
        "tissue": "CL_0000235"
      },
      "1": {
        "qtl_group": "macrophage_IFNg",
        "se": 0.0312826,
        "beta": 0.0307136,
        "median_tpm": 3.798,
        "study_id": "Alasoo_2018",
        "neg_log10_pvalue": 0.4820470569928696,
        "rsid": "rs5812225",
        "chromosome": "15",
        "type": "INDEL",
        "alt": "C",
        "position": 42514003,
        "ac": 125.0,
        "maf": 0.255952,
        "variant": "chr15_42514003_CG_C",
        "ref": "CG",
        "pvalue": 0.329574,
        "r2": 0.81564,
        "an": 168.0,
        "molecular_trait_id": "ENSG00000067369",
        "gene_id": "ENSG00000067369",
        "tissue": "CL_0000235"
      }
    }
  },
  "_links": {
    "self": {
      "href": "http://www.ebi.ac.uk/eqtl/api/genes/ENSG00000067369/associations?links=False"
    },
    "first": {
      "href": "http://www.ebi.ac.uk/eqtl/api/genes/ENSG00000067369/associations?size=2&links=False&start=0"
    },
    "next": {
      "href": "http://www.ebi.ac.uk/eqtl/api/genes/ENSG00000067369/associations?size=2&links=False&start=2"
    }
  }
}

Note that the API is paginated, so you'll need to specify the number of results to return on each page (the default is 20) and the link to the next page is returned in _links.next.href. There are more query parameters that you can tweak, such as p_lower and p_upper.

ADD COMMENT
0
Entering edit mode

Mihai! Thank you so much. I know I ask a lot here, but did you figure out the way in Python, and solving the trouble with the http request format Python had?

ADD REPLY
0
Entering edit mode

Right, so that ended up being just a few lines of code for the undocumented https://www.ebi.ac.uk/gwas/api/search/advancefilter API, but I think it's not a good idea to use it because the output JSON is huge and its schema isn't clear. It also happens to stream multiple top-level JSON objects, so you'll probably need an advanced parser to deal with the returned data. Basically, you get something like this: {...}{...}...{...}, where each {...} is an independent JSON document.

Here's how to call it for the same example gene:

import requests

endpoint = "https://www.ebi.ac.uk/gwas/api/search/advancefilter"

qPayload = 'ensemblMappedGenes: "TP53BP1" OR association_ensemblMappedGenes: "TP53BP1"'

headers = {'Content-Type': 'application/x-www-form-urlencoded'}

response = requests.post(endpoint, headers=headers, data=[('q', qPayload)])

print(response.content)

For the documented API I mentioned above, things are much easier, although the API seems a bit slow to respond, since it takes several seconds before you get the JSON data back. However, you do get a single JSON object in the response and, like I mentioned above, the pagination works too.

import requests

endpoint = "https://www.ebi.ac.uk/eqtl/api/genes/ENSG00000067369/associations"

params = {'size': '2'}

response = requests.get(endpoint, params=params)

print(response.content)
ADD REPLY
0
Entering edit mode

Hi Mihai,

I had a look at the output. It works but ... the output is something related to eQTLs. That is close, but its not the (disease) traits associated with a gene. I was naive enough to try this approach:

import requests

endpoint = "https://www.ebi.ac.uk/gwas/rest/api/singleNucleotidePolymorphisms/search"

params = {'size': '3', "geneName":"ENSG00000106633"}

response = requests.get(endpoint, params=params)

print(response.content)

After looking at this documentation: https://www.ebi.ac.uk/gwas/rest/docs/api#_example_request_2

https://www.ebi.ac.uk/gwas/rest/api/singleNucleotidePolymorphisms/search/findByGene{?geneName,page,size,sort,projection}

Output was kind of spaghetti, any pointers?

ADD REPLY
0
Entering edit mode

Hey Simon, sorry about that... It did seem like it was too easy to be correct :)

I looked into that API endpoint, but I think it's not returning what you want (see below)... Maybe this code I just found does the job? https://github.com/KatrionaGoldmann/omicAnnotations/blob/5d0a4dfda6ca55349408f2c6ee0792a02004696f/R/associated_publications.R I'm not skilled enough at R, but it looks like the heavy lifting is done by this package: https://cran.r-project.org/web/packages/easyPubMed/index.html

What I was able to do with that endpoint you provided is:

import json
import requests

# curl -s "https://www.ebi.ac.uk/gwas/rest/api/singleNucleotidePolymorphisms/search/findByGene?geneName=GCK&size=100" | jq -r '._embedded.singleNucleotidePolymorphisms[].rsId'

endpoint = "https://www.ebi.ac.uk/gwas/rest/api/singleNucleotidePolymorphisms/search/findByGene"

params = {'size': '100', 'geneName': 'GCK'}

response = requests.get(endpoint, params=params)

# print(response.content)

snps = json.loads(response.text)['_embedded']['singleNucleotidePolymorphisms']

rsIds = [snp['rsId'] for snp in snps]

print(rsIds)
ADD REPLY
0
Entering edit mode

So, I have been in contact with EBI and those nice guys forwarded me this idea:

import pandas as pd

df = pd.read_table('https://www.ebi.ac.uk/gwas/api/search/downloads/alternative')
gene = 'TP53'
df[df['MAPPED_GENE'].str.contains(r'(^|.*[\s,-]){gene}($|[\s,].*)'.format(gene=gene), na=False)]

Another idea was this guy in bash (but made it not reproducible for some genes):

gene="TP53"
curl -s "https://www.ebi.ac.uk/gwas/api/search/downloads/alternative" | head -n1 > ${gene}.tsv; curl -s "https://www.ebi.ac.uk/gwas/api/search/downloads/alternative" |  awk -F "\t" -v g="$gene" '$15==g' >> ${gene}.tsv

Can you please add that part to your answer above Mihai, so that I can (what I did already) accept your answer?

ADD REPLY
1
Entering edit mode

Hehe, of course they have some other API... This one is a bit nasty, because it doesn't let you filter on the gene name before download, but, luckily, the downloaded file is just 161MB.

I got this to work for example for gene TP53BP1:

curl -s https://www.ebi.ac.uk/gwas/api/search/downloads/alternative | grep "\tTP53BP1\t" | awk -F '\t' '{ print $21 " - " $6 }'

And with Python:

import pandas as pd

gene = 'TP53BP1'

pd.set_option('display.max_rows', None)

df = pd.read_table('https://www.ebi.ac.uk/gwas/api/search/downloads/alternative', dtype='unicode')

associations = df.loc[df['MAPPED_GENE'] == gene]

print(associations[['STRONGEST SNP-RISK ALLELE', 'LINK']])

You could download the file once and then load it from disk instead of fetching it each time you run this script.

ADD REPLY

Login before adding your answer.

Traffic: 1705 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6