Question

Extract information from Ensembl

0

Entering edit mode

2.9 years ago

bouchenak.chuxi • 0

Hello everyone,

i'm asked to automate a process including SNP research on Ensembl Web site. The process was set on couple peptides , now we have to work on ~ 500 peptides.

The pipeline included :

1 - BLAST (done locally with command line), from which we get ''accession number" of 2 top Best Hits (BH)

2 - this accession numbers are passed on Ensembl to study their variations (variant table in the image)

3 - Results: for the table below, i need to extract the following information/columns for all peptides ( 500 (initial pep) 2 (BH) = 1000* ) : Global MAF , Conseq. Type , AA and Transcripts

Any idea how i can do that ? i heard APIs or Web scraping may do that , any help please !

Screenshot of manual research result

SNP automation python Ensembl • 2.2k views

ADD COMMENT • link updated 2.9 years ago by Michael 54k • written 2.9 years ago by bouchenak.chuxi • 0

0

Entering edit mode

Please, please, please don't do web-scraping.

ADD REPLY • link 2.9 years ago by Emily 23k

0

Entering edit mode

2.9 years ago

Michael 54k

That is definitely a BioMart task. No way screen-scraping is going to work here.

On the Ensembl site, click BioMart, then design your query in the query builder. Once you are happy with the query, you can export it into a Perl script. You can modify the perl script to use with different accessions to use on the fly, or simply define a filter for gene ids via the web interface.

Edit:

Did a little experiment using the biomaRt package in R. And it looks it is feasible to download the data, if the query is crafted carefully. Here, I am just taking 10 gene. The query time might depend mostly on the number of variants per gene, which can vary every broadly. At this rate, downloading all variants for 1000 genes would take approximately 2 hours. The advantage of this approach is that it requires little if any parsing and less network traffic. If there are still concerns of overusing the server, one could break down the whole job into even smaller packages and distribute over a few hours.

library(biomaRt)
ensembl.mart <- useEnsembl(biomart = "ensembl")
d <- useDataset(mart = ensembl.mart, dataset = "hsapiens_gene_ensembl")
system.time( gene.ids <- unlist(getBM(mart = d, attributes = c("ensembl_gene_id" )  )))

# get variants one gene at a time not to overload the server
get.one <- function(gene.id) {
    getBM(mart = d, attributes = c(
        "ensembl_gene_id", 
        "variation_name", 
        "minor_allele_freq", 
        "allele" ), filters = list(ensembl_gene_id=gene.id)) 
}
res <- NULL
system.time( 
    for (i in 1:10) { 
        tmp <- get.one(sample(gene.ids, 1))
        print (paste("got", nrow(tmp), "results" ))
        if (nrow(tmp) > 0) {
            res <- rbind(res, tmp)
        }
        Sys.sleep(5) # Pause a while not to overuse the server
    }
)
dim(res)
object.size(res)

Output:

[1] "got 10069 results"
[1] "got 3090 results"
[1] "got 7685 results"
[1] "got 4224 results"
[1] "got 3 results"
[1] "got 3124 results"
[1] "got 1 results"
[1] "got 1 results"
[1] "got 1 results"
[1] "got 4816 results"
   user  system elapsed 
  2.721   0.265  68.091 
> dim(res)
[1] 33014     4
> object.size(res)
3300368 bytes

ADD COMMENT • link 2.9 years ago by Michael 54k

0

Entering edit mode

Please, please, please, don't use BioMart for this. It is too big a task for BioMart to handle and you'll jam up our servers.

ADD REPLY • link 2.9 years ago by Emily 23k

0

Entering edit mode

Thanks Micheal for your suggestion. I was actually trying to figure out how to work that on my queries datasete until Emily's intervention.

ADD REPLY • link 2.9 years ago by bouchenak.chuxi • 0

0

Entering edit mode

Hi, sorry for that. I was of the conviction that BioMart was the most efficient tool for retrieving large scale data. Unfortunately, there are too many variants, so variants are a special case. Interestingly, it is not even easy to find out how many there are in single gene. So if you query 1000 genes, you have to expect millions of variants. You should definitely go with the recommendation from Ensembl staff, but I am not sure you will be successfully scale that up to thousands of genes.

ADD REPLY • link 2.9 years ago by Michael 54k

0

Entering edit mode

Exactly, there's hug informations to get, we are thinking of a way to filter our needs in order to facilitate the research.

Thanks for your intervention Michael :) .

ADD REPLY • link 2.9 years ago by bouchenak.chuxi • 0

0

Entering edit mode

I think I got a feasible solution here, please see the edit.

ADD REPLY • link 2.9 years ago by Michael 54k

score 2 · Accepted Answer · 2021-05-25

2

Entering edit mode

2.9 years ago

Emily 23k

I would recommend the Ensembl REST API with the overlap/id endpoint, for example:

https://rest.ensembl.org/overlap/id/ENSG00000163810?feature=variation;content-type=application/json

Script around this in your preferred programming language to go through your input list and pull out the datapoints you need. There's an online course to get you started in Python or R if you need it.

ADD COMMENT • link 2.9 years ago by Emily 23k

0

Entering edit mode

This is what I had planned to do at the first place, but since I never used APIs before, i was confused about which one to chose for SNPs reseach .. especially that as you can see in the picture for each ID, we get a hug number of SNPs (scrolling barre at the fac right )

So you recommend that i use the " overlap/id " where i put the accession number as ID and filter the output in order to get only the information's mentioned in my post above, is it right?

I'll take the APIs course right away to better understand the process. Thank you Emily.

ADD REPLY • link 2.9 years ago by bouchenak.chuxi • 0

0

Entering edit mode

Hello again Emily,

Your course was of major help for me thank you so much, it's exactly what i was looking for .

I did the exercises of the Jupyter Notebook. i have a little problem When i run the code on Arabidopsis thaliana gene AT3G52430, i get the results , as shown below

However, when i do that on homo sapien gene XP_006719985.1 , i get this error :

HTTPError: 400 Client Error: Bad Request for url: http://rest.ensembl.org//overlap/id/ENST00000620763.1?feature=variation

1st code

import requests, sys, json
from pprint import pprint

def fetch_endpoint(server, request, content_type):

    r = requests.get(server+request, headers={ "Accept" : content_type})

    if not r.ok:
        r.raise_for_status()
        sys.exit()

    if content_type == 'application/json':
        return r.json()
    else:
        return r.text

print ("Variant\tSource\tAlleles")

# define the general URL parameters
server = "http://rest.ensembl.org/"
ext_overlap = "/overlap/id/AT3G52430?feature=variation"
con = "application/json"

# submit the query
get_overlap = fetch_endpoint(server, ext_overlap, con)

pprint (get_overlap)

# for variant in get_overlap:
#     id = variant['id']
#     source = variant['source']
#     alleles = ', '.join(variant['alleles'])

#     print (id + "\t" + source + "\t" + alleles)

{'alleles': ['G', 'T'], 'assembly_name': 'TAIR10',
'clinical_significance': [], 'consequence_type': 'missense_variant', 'end': 19433465, 'feature_type': 'variation', 'id': 'ENSVATH00415810', 'seq_region_name': '3', 'source': 'The 1001 Genomes Project', 'start': 19433465, 'strand': 1},

2nd code

import requests, sys, json
from pprint import pprint



r = requests.get(server+request, headers={ "Accept" : content_type})

if not r.ok:
    r.raise_for_status()
    sys.exit()

if content_type == 'application/json':
    return r.json()
else:
    return r.text
# define the general URL parameters
server = "http://rest.ensembl.org/"
ext_overlap = "/overlap/id/ENST00000620763.1?feature=variation"
con = "application/json"

# submit the query
get_overlap = fetch_endpoint(server, ext_overlap, con)

pprint (get_overlap)

# for variant in get_overlap:
#     id = variant['id']
#     source = variant['source']
#     alleles = ', '.join(variant['alleles'])

#     print (id + "\t" + source + "\t" + alleles)`

--------------------------------------------------------------------------- HTTPError Traceback (most recent call last) <ipython-input-11-d5dde983fc74> in <module>() 23 24 # submit the query ---> 25 get_overlap = fetch_endpoint(server, ext_overlap, con) 26 27 pprint (get_overlap)

1 frames /usr/local/lib/python3.7/dist-packages/requests/models.py in raise_for_status(self) 939 940 if http_error_msg: --> 941 raise HTTPError(http_error_msg, response=self) 942 943 def close(self):

HTTPError: 400 Client Error: Bad Request for url: http://rest.ensembl.org//overlap/id/ENST00000620763.1?feature=variation

ADD REPLY • link 2.9 years ago by bouchenak.chuxi • 0

0

Entering edit mode

Try it without the version number on the transcript ID, eg:

ext_overlap = "/overlap/id/ENST00000620763?feature=variation"

ADD REPLY • link 2.9 years ago by Emily 23k

0

Entering edit mode

Hi i did that, it worked perfectly, i could then manage that to obtain the transcript of interest according to its version number.
There juste some datapoints that i can't find : Global MAF (corresponding to allele frequency) , AA and AA cord

this is the list of features available in the documentation (link to doc) : Enum(band, gene, transcript, cds, exon, repeat, simple, misc, variation, somatic_variation, structural_variation, somatic_structural_variation, constrained, regulatory, motif, chipseq, array_probe).

I tried to run them all and look for that info, i think I'm missing it somehow ..

ext_overlap = "overlap/id/AT3G52430?feature=variation;feature=transcript"

variation,
{'alleles': ['G', 'A'], 'assembly_name': 'GRCh38', 'clinical_significance': [], concequence_type': 'missense_variant', 'end': 54333651, 'feature_type': 'variation', 'variant_id': 'rs782739360', 'seq_region_name': 'X', 'source': 'dbSNP', 'start': 54333651, 'strand': 1}]

transcript,
{'Parent': 'ENSG00000234566', 'assembly_name': 'GRCh38', 'biotype': 'processed_pseudogene', 'description': None, 'end': 54224062, 'external_name': 'RPL7AP71-201', 'feature_type': 'transcript', 'id': 'ENST00000451120', 'logic_name': 'havana_homo_sapiens', 'seq_region_name': 'X', 'source': 'havana', 'start': 54223324, 'strand': 1, 'tag': 'basic', 'transcript_id': 'ENST00000451120', 'transcript_support_level': 'NA', 'version': 1}]

ADD REPLY • link 2.9 years ago by bouchenak.chuxi • 0

0

Entering edit mode

You may need to put the variants into the VEP endpoint to get those.

ADD REPLY • link 2.9 years ago by Emily 23k

0

Entering edit mode

Hi, I am still interested to know how this is scaling up to 1000 genes and how this compares to a BioMart query in both data volume and backend load. If I understand correctly, the output given here contains the variants for a single gene and the format leads to a lot of additional data transfer.

ADD REPLY • link 2.9 years ago by Michael 54k