Tutorial:Using The Biomart Perl Api For Simple Queries
1
15
Entering edit mode
12.2 years ago

I recently had need to automate a biomart query and found that the perl API was most convenient for this purpose. Eventually I believe that BioMart will be moving away from (or refactoring) the perl API. But, until then, it seems the most convenient way to access biomart programmatically.

This sample script queries the InterPro biomart website for details corresponding to an InterPro accession. A sample perl snippet was obtained from the Biomart website and used as a starting point. The result will be a list of UniProtKB protein accessions and other details for the provided InterPro accession, after several filters. Almost any query you construct at the BioMart web api could be run in this manner. Simply click on the 'Perl' button to see how query lines would need to be changed. The script below should help you with some issues which are not explained in the provided code snippets and (non-existent) documentation for the Perl API. This includes: how to handle timeout errors, how to turn result counting on and off, and how to redirect output from STDOUT to a file.

You must have biomart-perl installed for this script to work. This can be downloaded from: http://www.biomart.org/other/install-overview.html. See the section title "1.2 Downloading biomart-perl" for CVS commands to run and "1.4 Installing biomart-perl" for instructions on how to install. There were a number of dependencies missing during my installation, but the following code worked without resolving them. Results may vary - ideally you will want root access or have your system admin install any missing dependencies.

A registry file must also be provided. This can be obtained from: http://www.biomart.org/biomart/martservice?type=registry. Copy this into a file and then delete all entries except those corresponding to INTERPRO and UNIPROT (or whichever database(s) you intend to query). This last step reduces the amount of time required to load all registries.

Note regarding timeout errors: If queries are taking too long to complete and you receive time out errors. Find the following line: $ua->timeout(20); in ~/biomart-perl/lib/BioMart/Configuration/URLLocation.pm (wherever you installed biomart-perl) and increase value to 180 (i.e., $ua->timeout(180);). I have also written the script below so that it will automatically retry queries until they succeed.

#!/usr/bin/perl
use strict;
use warnings;
use lib '~/biomart-perl/lib'; #Set this to path where you installed biomart-perl
use BioMart::Initializer;
use BioMart::Query;
use BioMart::QueryRunner;
my $confFile = "~/biomart-perl/conf/biomart_Interpro_registry.xml"; #Set this to path where you downloaded registry file
my $tempfile = "biomart_query_temp.txt";

#Note: change action to 'clean' if you wish to start a fresh configuration  
#Set to 'cached' if you want to skip configuration step on subsequent runs from the same registry
my $action='cached';
my $initializer = BioMart::Initializer->new('registryFile'=>$confFile, 'action'=>$action);
my $registry = $initializer->getRegistry;

For this example we will query Uniprot Biomart with a single InterPro query term and filter down to only proteins: (1) In "The complete human proteome", see: http://www.uniprot.org/faq/48. (2) With Swiss-prot (Reviewed) status, see http://www.uniprot.org/faq/7. (3) With evidence at protein level, see http://www.uniprot.org/docs/pe_criteria. For output, we will retrieve: Uniprot Accession, Uniprot Id, Uniprot Protein Name, Uniprot Gene Name

my $queryterm="IPR000022";
print "\nAttempting UniProt list query for $queryterm\n";
my $query = BioMart::Query->new('registry'=>$registry,'virtualSchemaName'=>'default');
$query->setDataset("uniprot");
$query->addFilter("interpro_id", [$queryterm]);
$query->addFilter("proteome_name", ["Homo sapiens"]);
$query->addFilter("entry_type", ["Swiss-Prot"]);
$query->addFilter("protein_evidence", ["1: Evidence at protein level"]);
$query->addAttribute("accession");
$query->addAttribute("name");
$query->addAttribute("protein_name");
$query->addAttribute("gene_name");
$query->addAttribute("protein_evidence");
$query->addAttribute("entry_type");
my $query_runner = BioMart::QueryRunner->new();
$query_runner->uniqueRowsOnly(1); #to obtain unique rows only

Get count of expected results - use to make sure results are complete

my $count_query_attempt=1;
#Turn on counting
$query->count(1);
my $query_count;
do {
  print "Attempting query count, attempt $count_query_attempt\n";
  $query_runner->execute($query);
  $query_count=$query_runner->getCount();
  sleep(1);
  $count_query_attempt++;
} until ($query_count);

print "$query_count results expected for query\n";
#turn off counting so that full results can be obtained below
$query->count(0);

Perform main query of interest. Note that results are directed to STDOUT by default. Therefore we will redirect and store output in a temporary file.

my $query_attempt=1;
my $result_count;
my @results;
do {
  print "Attempting query, attempt $query_attempt\n";
  open (BIOMART_OUT, ">$tempfile") or die "Can't open $tempfile file for write\n";
  $query_runner->execute($query);
  #$query_runner->printHeader(\*BIOMART_OUT);
  $query_runner->printResults(\*BIOMART_OUT);
  #$query_runner->printFooter(\*BIOMART_OUT);
  close BIOMART_OUT;

  #Read in results and check expected results against count above
  open (BIOMART_IN, "$tempfile") or die "Can't open $tempfile\n";
  @results=<BIOMART_IN>;
  close BIOMART_IN;
  $result_count=@results;
  print "$result_count results returned for query\n\n";
  sleep(1);
  $query_attempt++;
} until ($result_count==$query_count);

Finally, parse the results and print out in a tab-delimited format

chomp (@results);
my %UniProtDetails;
foreach my $result (@results){
  my @data=split("\t", $result);
  my $Uniprot_acc=$data[0];
  my $Uniprot_id=$data[1];
  my $Uniprot_protein_name=$data[2]; unless($Uniprot_protein_name){$Uniprot_protein_name="NA";}
  my $Uniprot_gene_name=$data[3]; unless($Uniprot_gene_name){$Uniprot_gene_name="NA";}
  my $Uniprot_evidence=$data[4]; unless($Uniprot_evidence){$Uniprot_evidence="NA";}
  my $Uniprot_status=$data[5]; unless($Uniprot_status){$Uniprot_status="NA";}
  $UniProtDetails{$Uniprot_acc}{Uniprot_id}=$Uniprot_id;
  $UniProtDetails{$Uniprot_acc}{Uniprot_protein_name}=$Uniprot_protein_name;
  $UniProtDetails{$Uniprot_acc}{Uniprot_gene_name}=$Uniprot_gene_name;
  $UniProtDetails{$Uniprot_acc}{Uniprot_evidence}=$Uniprot_evidence;
  $UniProtDetails{$Uniprot_acc}{Uniprot_status}=$Uniprot_status;
  }

print "Uniprot_acc\tUniprot_id\tUniprot_protein_name\tUniprot_gene_name\n";
foreach my $uniprot_acc (sort keys %UniProtDetails){
  print "$uniprot_acc\t$UniProtDetails{$uniprot_acc}{'Uniprot_id'}\t$UniProtDetails{$uniprot_acc}{'Uniprot_protein_name'}\t$UniProtDetails{$uniprot_acc}{'Uniprot_gene_name'}\n";
}
api perl biomart • 9.2k views
ADD COMMENT
1
Entering edit mode

The fact that you have to login to CVS to access the biomart-perl is painful! Why not just put it on CPAN!

ADD REPLY
0
Entering edit mode

I get error Unknown host cvs.sanger.ac.uk. when try to install from CVS.

ADD REPLY
0
Entering edit mode

If someone is still using this, BioMart perl is on github: https://github.com/biomart/biomart-perl It is still working. In the example code above, somehow > got replaced by >.

ADD REPLY
0
Entering edit mode

Suppose one lacks the permissions to alter the source directly to increase the timeout value? Is there a way catch the timeout error or to increase the timeout value in some other way?

ADD REPLY
0
Entering edit mode
9.2 years ago
jimmy_zeng ▴ 90

Is this the same to the biomaRt package in R?

Actually there are so many tools that implement the same function which puzzled me very much.

ADD COMMENT

Login before adding your answer.

Traffic: 1996 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6