Question: Python Bioservices doesn't retrieve the right data from biomart
3
gravatar for rs
4.2 years ago by
rs40
United Kingdom
rs40 wrote:

Hi,

I have a pipeline where I would previously go on ensembl biomart and retrieve data but would now like to automatize a bit more and get my data from biomart directly in a Python script using the Bioservices package. My initial input is a list of RefSeq mRNA identifiers and then I use biomart to retrieve the corresponding ensembl gene identifiers. This works fine. Then I want to give these ensembl identifiers to biomart and to get the corresponding CDS sequences, accompanied by the ensembl gene ID, the ensembl transcript ID and the chromosomal start positions of exons*. At this point, I expect to get back a bit less than 2000 sequences (because this is what I get when I do this manually through my browser). In actuality, I only get 8 sequences and they are accompanied not by ensembl IDs but rather by LRG IDs! The sequences I do get look about right though. If anybody had any idea what was going on here, I would really appreciate your help. I am using Python3.4 on Mac OS X 10.9.5. Here is my code:

input_file_name = "single_exon_pcgenes_refseq2.txt"

from bioservices import *
import re
e = ensembl.Ensembl()

bm = BioMart()
bm.datasets("ensembl")
bm.add_dataset_to_xml("hsapiens_gene_ensembl")
refseq_ids_file = open(input_file_name)
refseq_ids = refseq_ids_file.readlines()
refseq_ids_file.close()
refseq_ids = [re.sub("\n","",i) for i in refseq_ids]
refseq_ids = ",".join(refseq_ids)
bm.add_filter_to_xml("refseq_mrna", refseq_ids)
bm.add_attribute_to_xml("ensembl_gene_id")
xml_query = bm.get_xml()
ens_ids_raw = bm.query(xml_query)

bm.new_query()
bm.add_dataset_to_xml("hsapiens_gene_ensembl")
bm.add_filter_to_xml("ens_hs_gene", ens_ids_raw)
bm.add_attribute_to_xml("ensembl_gene_id")
bm.add_attribute_to_xml("ensembl_transcript_id")
bm.add_attribute_to_xml("exon_chrom_start")
bm.add_attribute_to_xml("coding")
xml_query_seq = bm.get_xml()
raw_sequences = bm.query(xml_query_seq)

This is one of the four sequences contained in raw_sequences at the moment (I deleted the middle part of the sequence to make it easier to read and added newlines):

ATGGAGGGGATCAGTATATACACTTCAGATAACTACACCGAGGAAATGGGCTCAGGGGACTATGACTCCAT

AACCTCTACAGCAGTGTCCTCATCCTGGCCTTCATCAGTCTGGACCGCTACCTGGCCATCGTCCACGCCA

GAGTCTTCAAGTTTTCACTCCAGCTAA    LRG_51    LRG_51t1    5001;7244

 

*This might seem like a strange way of doing things but I can't ask for the sequences directly in the first step using the RefSeq IDs as input because this would only give me the sequences of the particular transcripts whereas I need the sequences of all the transcripts from each corresponding gene. But I need to filter them first to make sure that each of those genes has at least one single-exon transcript, which is why I need to go through the transcript IDs in the first place (I get those from UCSC).

ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by rs40

Kudos to you for phrasing the question so well!

ADD REPLYlink written 4.2 years ago by RamRS21k
2
gravatar for rs
4.2 years ago by
rs40
United Kingdom
rs40 wrote:

I got it!

Sorry, I should have kept on playing with it for a few more hours before turning to you guys but it was starting to look a bit grim there a while ago.

It turns out I was simply using the wrong filter name. So

bm.add_filter_to_xml("ens_hs_gene", ens_ids_raw)

should have been

bm.add_filter_to_xml("ensembl_gene_id, ens_ids_raw)

Sorry, that was a dumb one!

ADD COMMENTlink written 4.2 years ago by rs40
1

This is like when we find typos right after hitting Send on an email. Happens to all of us :)

ADD REPLYlink written 4.2 years ago by RamRS21k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1943 users visited in the last hour