Question

Querying VEP using REST API

0

Entering edit mode

5.5 years ago

Arko ▴ 30

I'd like to query a JSON output for VEP using the REST API. I have the following code in python 3, can anyone tell me where I'm going wrong? I'm getting a HTTP 400 error.

import json
import requests
import sys
enter code here`with open('sample.tsv','r') as file:
for line in file:
if line.startswith("rs"):
snps = line.split()[0]
for i in snps :
server = "https://rest.ensembl.org"
ext = "/vep/human/id"
headers = { "Content-Type" : "application/json", "Accept" : "application/json"}
r = requests.post(server+ext, headers=headers, data='{ "ids" : [i] }')

if not r.ok:
r.raise_for_status()
sys.exit()

decoded = r.json()
print(repr(decoded))

This is pretty basic, I just want to get multiple outputs at the same time, instead of one by one. Any suggestions?

I wouldn't mind doing it using R either, if that's a good alternative. Thanks!

REST VEP Ensembl python R • 3.5k views

ADD COMMENT • link updated 5.5 years ago by Emily 23k • written 5.5 years ago by Arko ▴ 30

0

Entering edit mode

can you please print i here? what are you querying?

ADD REPLY • link 5.5 years ago by cpad0112 21k

0

Entering edit mode

rs2329763 rs6716521 rs4686605 rs9290916 rs7622109 rs16877127 rs160885 rs16908004 rs622120 rs970275 . . 1000 rows more

ADD REPLY • link 5.5 years ago by Arko ▴ 30

2

Entering edit mode

with open('test.txt','r') as f:
    test=f.readlines()

import requests, sys
server = "https://rest.ensembl.org"
for i in test:
    ext = "/variant_recoder/human/"+i
    r = requests.get(server+ext, headers={ "Content-Type" : "application/json"})
    decoded = r.json()
    print(repr(decoded))

output:

[{'id': ['rs56116432'], 'input': 'rs56116432', 'hgvsp': ['ENSP00000483018.1:p.Gly229Asp', 'ENSP00000483265.1:p.Gly229Asp', 'ENSP00000487108.2:p.Gly230Asp', 'ENSP00000494079.1:p.Gly230Asp', 'ENSP00000494984.1:p.Gly230Asp', 'ENSP00000496236.1:p.Gly230Asp'], 'hgvsc': ['ENST00000453660.4:n.718G>A', 'ENST00000538324.2:c.686G>A', 'ENST00000611156.4:c.686G>A', 'ENST00000647353.1:n.54-4890G>A', 'ENST00000626615.2:c.689G>A', 'ENST00000644422.1:c.689G>A', 'ENST00000644755.1:c.689G>A', 'ENST00000645810.1:c.689G>A'], 'hgvsg': ['NC_000009.12:g.133256042C>T', 'CHR_HG2030_PATCH:g.133256189C>T']}]
[{'id': ['rs56116431'], 'input': 'rs56116431', 'hgvsc': ['ENST00000274498.9:c.1539-23605del', 'ENST00000378004.8:c.1539-23605del', 'ENST00000418236.5:c.254-23605del', 'ENST00000443674.5:c.395-23605del', 'ENST00000642734.1:c.1431-23605del', 'ENST00000645722.1:c.1539-23605del', 'LRG_1127t1:c.1539-23605del'], 'hgvsg': ['NC_000005.10:g.143097383del', 'LRG_1127:g.332014del']}]

input:

$ cat test.txt 
rs56116432
rs56116431

ADD REPLY • link 5.5 years ago by cpad0112 21k

0

Entering edit mode

One question, do you have a good way of parsing or formatting the JSON output ? Right now I'm outputting everything into a text file.

ADD REPLY • link 5.5 years ago by Arko ▴ 30

0

Entering edit mode

Sure. What kind of output are you expecting?

ADD REPLY • link 5.5 years ago by cpad0112 21k

0

Entering edit mode

All fields into a tsv or a csv format preferably. Something that's easy to read. Right now it's a mess when outputting into a text file directly.

ADD REPLY • link 5.5 years ago by Arko ▴ 30

1

Entering edit mode

I dump json file and then I use jq (standalone) offline for parsing output from Ensembl. I generally catch g,c, p syntax and calculated effect. But I do know there are enough json libraries in all languages, that can parse the way you want.

ADD REPLY • link 5.5 years ago by cpad0112 21k

0

Entering edit mode

Please can you indent your script properly so that other people can run it.

ADD REPLY • link 5.5 years ago by Emily 23k

0

Entering edit mode

5.5 years ago

Pierre Lindenbaum 161k

data='{ "ids" : [i] }')

there is no such ids parameter in the documentation : https://rest.ensembl.org/documentation/info/vep_id_get

ADD COMMENT • link 5.5 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

They're using the POST endpoint

ADD REPLY • link 5.5 years ago by Emily 23k

score 4 · Accepted Answer · 2018-10-24

4

Entering edit mode

5.5 years ago

Emily 23k

You need to create a JSON dump of your list of IDs. Then you can use your whole list as input for the endpoint, you don't have to go through the list with a for loop.

Assuming you've created the array snps which contains a list of rsIDs:

data = json.dumps({ "ids" : snps })
server = "https://rest.ensembl.org"
ext = "/vep/human/id"
headers = { "Content-Type" : "application/json", "Accept" : "application/json"}
r = requests.post(server+ext, headers=headers, data=data)

if not r.ok:
    r.raise_for_status()
    sys.exit()

decoded = r.json()
print(repr(decoded))

This will query everything at once.

ADD COMMENT • link 5.5 years ago by Emily 23k

0

Entering edit mode

Doesn't seem to work for me, receiving a HTTP 400 error. I've tried using a list as well as just a string with each rsid in a new line.

ADD REPLY • link 5.5 years ago by Arko ▴ 30

1

Entering edit mode

try this. Output is given below: code:

import json
import requests
import sys

with open('test.txt','r') as f:
    test=f.readlines()

data = json.dumps({ "ids" : test })
server = "https://rest.ensembl.org"
ext = "/vep/human/id"
headers = { "Content-Type" : "application/json", "Accept" : "application/json"}
r = requests.post(server+ext, headers=headers, data=data)

if not r.ok:
    r.raise_for_status()
    sys.exit()

decoded = r.json()
print(repr(decoded))

input:

$ cat test.txt 
rs56116432
rs56116431

output:

[{'colocated_variants': [{'allele_string': 'C/T', 'frequencies': {'T': {'afr': '0', 'gnomad_fin': '0.01363', 'gnomad': '0.003639', 'gnomad_afr': '0.0006606', 'eas': '0', 'gnomad_amr': '0.002336', 'ea': '0.003809', 'amr': '0.0014', 'gnomad_sas': '0.001334', 'gnomad_nfe': '0.003593', 'eur': '0.0109', 'sas': '0.001', 'gnomad_asj': '0.002471', 'gnomad_oth': '0.00628', 'gnomad_eas': '0', 'aa': '0.0007102'}}, 'start': '133256042', 'end': '133256042', 'strand': '1', 'minor_allele': 'T', 'seq_region_name': '9', 'minor_allele_freq': '0.0026', 'id': 'rs56116432'}], 'id': 'rs56116432', 'end': 133256042, 'seq_region_name': '9', 'start': 133256042, 'assembly_name': 'GRCh38', 'input': 'rs56116432', 'most_severe_consequence': 'missense_variant', 'allele_string': 'C/T', 'strand': 1, 'transcript_consequences': [{'gene_id': 'ENSG00000175164', 'gene_symbol_source': 'HGNC', 'gene_symbol': 'ABO', 'cdna_end': 718, 'hgnc_id': 'HGNC:79', 'transcript_id': 'ENST00000453660', 'cdna_start': 718, 'impact': 'MODIFIER', 'biotype': 'processed_transcript', 'consequence_terms': ['non_coding_transcript_exon_variant'], 'variant_allele': 'T', 'strand': -1}, {'strand': -1, 'consequence_terms': ['missense_variant'], 'cdna_end': 711, 'gene_symbol_source': 'HGNC', 'gene_id': 'ENSG00000175164', 'hgnc_id': 'HGNC:79', 'amino_acids': 'G/D', 'transcript_id': 'ENST00000538324', 'cdna_start': 711, 'protein_start': 229, 'variant_allele': 'T', 'codons': 'gGc/gAc', 'impact': 'MODERATE', 'biotype': 'protein_coding', 'sift_prediction': 'deleterious', '
        ============  removed text due to 5000 character limit===================
  'transcript_id': 'ENST00000642734', 'consequence_terms': ['intron_variant'], 'impact': 'MODIFIER', 'biotype': 'protein_coding', 'variant_allele': '-', 'strand': 1}, {'hgnc_id': 'HGNC:17073', 'gene_symbol': 'ARHGAP26', 'gene_id': 'ENSG00000145819', 'gene_symbol_source': 'HGNC', 'transcript_id': 'ENST00000645722', 'variant_allele': '-', 'strand': 1, 'impact': 'MODIFIER', 'biotype': 'protein_coding', 'consequence_terms': ['intron_variant']}], 'strand': 1, 'allele_string': 'A/-', 'id': 'rs56116431', 'colocated_variants': [{'id': 'rs56116431', 'seq_region_name': '5', 'strand': '1', 'end': '143097383', 'start': '143097383', 'allele_string': 'A/-'}], 'end': 143097383, 'start': 143097383, 'seq_region_name': '5'}]

ADD REPLY • link 5.5 years ago by cpad0112 21k

0

Entering edit mode

How long is your list? The endpoint has a limit of 200 variants, so if it's longer you may need to chunk it.

ADD REPLY • link 5.5 years ago by Emily 23k

0

Entering edit mode

That makes a lot of sense, it has 5000 variants. Chunk it into separate queries? Sounds time consuming to run 25+ chunks. Would you have a quicker or a more efficient way of going about this?

ADD REPLY • link 5.5 years ago by Arko ▴ 30

1

Entering edit mode

I would use the VEP script and run it all locally, but given that each query should take less than a second, 25+ of them is still going to be pretty quick.

ADD REPLY • link 5.5 years ago by Emily 23k

0

Entering edit mode

What does this look like if I want to run the API in R? (i.e. json.dumps() doesn't work?).

ADD REPLY • link 4.4 years ago by jbox88 • 0

1

Entering edit mode

You can make json in R using a library called jsonlite. Assuming that you have a list of values in a vector called snps, as in the above example, you could use:

data <- toJSON(list(ids=snps))

There is a full online course using Jupter notebooks available in R, Python and Perl with all the libraries and code examples you need.