Question

Automating Database Searches

2

Entering edit mode

12.8 years ago

Bnfoguy ▴ 70

I am trying to write a program for automating a process to find gene exons and introns to find genetic variation in genes. What I want to do is to write a script that will access the BioMart database and perform this function for me. Does anyone have an idea as to how this can happen?

Thank you,

Bnfoguy

biomart search ensembl • 3.8k views

ADD COMMENT • link updated 12.7 years ago by Akk ▴ 210 • written 12.8 years ago by Bnfoguy ▴ 70

1

Entering edit mode

magic ? :-)

ADD REPLY • link 12.8 years ago by Pierre Lindenbaum 161k

score 3 · Answer 1 · 2011-07-15

I cannot help you with the real juice, but you can access BioMart in several programmatic ways.

A very simple example of how to access BioMart 0.7 can be found here:

http://joachimbaran.wordpress.com/2011/06/17/bioknacks-pubmed2ensembl-query-wrapper/
https://github.com/joejimbo/bioknack/blob/master/bk_pubmed2ensembl.rb

There I use Darren Oakley's Ruby API, which I find very easy and straightforward to use.

For BioMart 0.8, you can either have a look at the docs (http://www.biomart.org/rc6_documentation.pdf) for the various methods to access the new BioMart; or fiddle around with the basic SPARQL-interface that was introduced in BioMart 0.8rc6.

For example, there are SPARQL-endpoints for the new Central Portal marts (http://central.biomart.org), such as the Ensembl Gene mart:

http://central.biomart.org/martwizard/#!/Genome?mart=gene_ensembl_config_4

When you get to the results page of a query, there is a "SPARQL" button that shows you the equivalent SPARQL-query that you can use to programmatically obtain the results via the SPARQL-endpoint. I have attached an example query below that I clicked together on the second link (gene_ensembl_config_4). I just added a "LIMIT 5" at the end manually. The endpoint for submitting the query is:

http://central.biomart.org/martsemantics/gene_ensembl_config_4/SPARQLXML/get/?query=*urlencoded SPARQL-query*

A list of queryable attributes for creating manual queries can be obtained from the ontology of the Ensembl mart. If you use a tool such as Protégé (http://protege.stanford.edu/) then you can load the ontology via the URI:

http://central.biomart.org/martsemantics/gene_ensembl_config_4/ontology

Using SPARQL to retrieve internal nodes or other meta-data is not possible (yet) with our implementation. For example, querying "select ?x ?y ?z where {?x ?y ?z}" to get all triples will not work. You need to consider the ontology for figuring out such things.

Example Query

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

PREFIX accesspoint: <http://central.biomart.org/martsemantics/gene_ensembl_config_4/ontology#>
PREFIX class: <biomart: central.biomart.org="" martsemantics="" gene_ensembl_config_4="" ontology="" class#="">
PREFIX dataset: <biomart: central.biomart.org="" martsemantics="" gene_ensembl_config_4="" ontology="" dataset#="">
PREFIX attribute:<biomart: central.biomart.org="" martsemantics="" gene_ensembl_config_4="" ontology="" attribute#="">

SELECT ?a0 ?a1 ?a2
FROM dataset:hsapiens_gene_ensembl
WHERE {
  ?mart attribute:biotype "protein_coding" .
  ?mart attribute:atlas_celltype "germ cell" .
  ?mart attribute:ensembl_gene_id ?a0 .
  ?mart attribute:ensembl_peptide_id ?a1 .
  ?mart attribute:ensembl_exon_id ?a2
}
LIMIT 5

score 3 · Answer 2 · 2011-07-19

Hello Chris,

As already mentioned in the other answers you can indeed access Ensembl BioMart programmatically.

However, in your question you mention that you want to retrieve introns. Unfortunately, we don't have information on these in BioMart. In contrast to what people sometimes think, the BioMart databases only contain a subset of the data that are present in the Ensembl databases and introns are not part of this subset.

So, for your purpose, I think using the Ensembl Core databases instead of the BioMart databases is the best option.

More information on the Core Perl API (installation, documentation, tutorial) you can find here:

http://www.ensembl.org/info/docs/api/core/index.html

We also regularly organise (free) 3-day API workshops in Hinxton and Cambridge (the next one is scheduled for 5-7 Sep in Hinton).

Hope this helps.

Bert

score 2 · Answer 3 · 2011-07-15

I posted a document on how to automate research through Ensembl using biomart, you can install the biomart perl API or simply use a script that call a webservice

You can find an interesting answer here

Here is a small perl script that you can adapt for your needs :

# an example script demonstrating the use of BioMart webservice
use strict;
use LWP::UserAgent;

open (FH,$ARGV[0]) || die ("\nUsage: perl getFromEnsembl.pl Query.xml\n\n");

my $xml;
while (<FH>){
    $xml .= $_;
}
close(FH);

my $path="http://www.biomart.org/biomart/martservice?";
my $request = HTTP::Request->new("POST",$path,HTTP::Headers->new(),'query='.$xml."\n");
my $ua = LWP::UserAgent->new;

my $response;

$ua->request($request,
         sub{
         my($data, $response) = @_;
         if ($response->is_success) {
             print "$data";
         }
         else {
             warn ("Problems with the web server: ".$response->status_line);
         }
         },1000);

The query.xml file is what you get from Biomart website when you design your query through their interface.

Hope this helps

score 0 · Answer 4 · 2011-08-08

Hi, and sorry for posting this a bit late.

Why do you want to use Biomart for this? It is usually a bad idea to script against Biomart since it can potentially put a lot of stress on our servers (unless you know what you're doing)

Here's a solution (for one single gene) which uses the Ensembl Perl API instead:

use strict;
use warnings;

use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db( '-host' => 'ensembldb.ensembl.org',
                                  '-port' => '5306',
                                  '-user' => 'anonymous',
                                  '-db_version' => '63' );

my $ga = $registry->get_adaptor( 'Human', 'Core', 'Gene' );

my $gene = $ga->fetch_by_stable_id('ENSG00000139597');

foreach my $transcript ( @{ $gene->get_all_Transcripts() } ) {
  printf( "%s: [%d-%d] (%+d)\n",
          $transcript->stable_id(), $transcript->start(),
          $transcript->end(),       $transcript->strand() );

  my @introns = @{ $transcript->get_all_Introns() };

  foreach my $exon ( @{ $transcript->get_all_Exons() } ) {
    printf( "\t%s: [%d-%d]\n",
            $exon->stable_id(), $exon->start(), $exon->end() );

    my $intron = shift(@introns);

    if ( defined($intron) ) {
      printf( "\tIntron: [%d-%d]\n", $intron->start(), $intron->end() );
    }
  }
}

This would easily be expanded to get, e.g., all genes on a chromosome or for a species, or to get the sequence for the exons and/or the introns, etc.

Cheers, K/A