How To Programmatically Retrieve A Batch Of Fasta Sequences From For A List Of Uniprot Accession Ids?
3
3
Entering edit mode
10.8 years ago
Superbest ▴ 130

I have a list of UniProtKB Protein Accessions. I want to fetch the peptide sequence corresponding to each accession in fasta format. Preferably, all the fasta sequences should be combined into a single file.

I would like to do this programmatically, with C#.

UniProt's FAQ explains how to do it. However, the code example they give is Perl:

use strict;
use warnings;
use LWP::UserAgent;

my $list = $ARGV[0]; # File containg list of UniProt identifiers.

my $base = 'http://www.uniprot.org';
my $tool = 'batch';

my $contact = ''; # Please set your email address here to help us debug in case of problems.
my $agent = LWP::UserAgent->new(agent => "libwww-perl $contact");
push @{$agent->requests_redirectable}, 'POST';

my $response = $agent->post("$base/$tool/",
                            [ 'file' => [$list],
                              'format' => 'txt',
                            ],
                            'Content_Type' => 'form-data');

while (my $wait = $response->header('Retry-After')) {
  print STDERR "Waiting ($wait)...\n";
  sleep $wait;
  $response = $agent->get($response->base);
}

$response->is_success ?
  print $response->content :
  die 'Failed, got ' . $response->status_line .
    ' for ' . $response->request->uri . "\n";

I don't know Perl (and would prefer not having to learn it). I can see that it submits the required identifier with an HTTP POST, but I can't figure out what the HTTP POST is supposed to contain (it would be really helpful if I had URL that I could paste into my browser to manually download the file). I tried using Fiddler to intercept the request, but Strawberry Perl manages to circumvent Fiddler.

How would I accomplish the same thing this code is doing in C#? What is the correct way to submit the web request?

uniprot perl webservice • 11k views
ADD COMMENT
1
Entering edit mode
ADD REPLY
0
Entering edit mode

I wish to do it programmatically, not manually.

ADD REPLY
0
Entering edit mode

You can do using one liner given by UCSC_UTILITIES. Once you download this use faSomeRecords

faSomeRecords in.fa listFile out.fa
ADD REPLY
0
Entering edit mode
10.8 years ago
bilouorama • 0

An other way is simply to get each web page http://www.uniprot.org/uniprot/P12345.fasta with libcurl or another library. It's slow because you ask each file independently.

You can also get pages through a wget or curl using the batch retreive service : http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=uniprotkb&id=P14060+P26439+P27364+Q62878+Q61767+Q61694+P14893+P22071+P22072+P24815+Q91997+Q67477+Q91381+Q98318+P21097+P26670&format=fasta&style=raw&Retrieve=Retrieve

With this command I get all the sequences in fasta format :-)

I don't know if a C# library exist like libcurl for C. But I hope so.

ADD COMMENT
0
Entering edit mode

This code, taken from the "Mapping database identifiers" section, maps identifiers from one database to another. I want to download the fasta sequences corresponding to the identifiers.

ADD REPLY
0
Entering edit mode

You're right... I confused... I change my answer.

ADD REPLY
0
Entering edit mode
10.8 years ago
Kenosis ★ 1.3k

Perhaps the following will be helpful:

use strict;
use warnings;
use LWP::Simple;

@ARGV == 2 or die "Usage: perl $0 accessions.txt outFile.fasta\n";

my $dashes  = '-' x 25 . "\n";
my $outFile = pop;
my %list    = map { chomp; $_ => 1 } <>;

my $accessions = join '+OR+', keys %list;
my $fastaRecs =
  get "http://www.uniprot.org/uniprot/?query=$accessions&force=yes&format=fasta"
  or die "No records returned.\n";

open my $fh, '>', $outFile or die $!;

for ( split />/, $fastaRecs ) {
    if ( /\|(.+?)\|/ and exists $list{$1} ) {
        print $fh ">$_";
        delete $list{$1};
    }
}

close $fh;

print $dashes;

if ( !keys %list ) {
    print "All accessions retrieved.\n";
}
else {
    print "Accessions not retrieved:\n";
    print "$_\n" for keys %list;
}

print $dashes;

Usage: perl script.pl accessions.txt outFile.fasta

The script builds an "OR" query based upon the accessions, checks that only those requested are finally printed to a file, and tracks those retrieved. The results are printed, and show whether all accessions have been retrieved or the missing accessions.

For example, when accessions.txt contained:

A2BC19
P12345
P69905

the output file contains:

>sp|P12345|AATM_RABIT Aspartate aminotransferase, mitochondrial OS=Oryctolagus cuniculus GN=GOT2 PE=1 SV=2
MALLHSARVLSGVASAFHPGLAAAASARASSWWAHVEMGPPDPILGVTEAYKRDTNSKKM
NLGVGAYRDDNGKPYVLPSVRKAEAQIAAKGLDKEYLPIGGLAEFCRASAELALGENSEV
VKSGRFVTVQTISGTGALRIGASFLQRFFKFSRDVFLPKPSWGNHTPIFRDAGMQLQSYR
YYDPKTCGFDFTGALEDISKIPEQSVLLLHACAHNPTGVDPRPEQWKEIATVVKKRNLFA
FFDMAYQGFASGDGDKDAWAVRHFIEQGINVCLCQSYAKNMGLYGERVGAFTVICKDADE
AKRVESQLKILIRPMYSNPPIHGARIASTILTSPDLRKQWLQEVKGMADRIIGMRTQLVS
NLKKEGSTHSWQHITDQIGMFCFTGLKPEQVERLTKEFSIYMTKDGRISVAGVTSGNVGY
LAHAIHQVTK
>sp|P69905|HBA_HUMAN Hemoglobin subunit alpha OS=Homo sapiens GN=HBA1 PE=1 SV=2
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG
KKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTP
AVHASLDKFLASVSTVLTSKYR
>tr|A2BC19|A2BC19_HELPX GTPase (Fragment) OS=Helicobacter pylori GN=yphC PE=4 SV=1
NISHKTLKTIAILGQPNVGKSSLFNRLARERIAITSDFAGTTRDINKRKIALNGHEVELL
DTGGMAKDALLSKEIKALNLKAAQMSDLILYVVDGKSIPSDEDIKLFREVFKTNPNCFLV
INKIDNDKEKERAYAFSSFGAPKSFNISVSHNRGISALIDAVLNALNLNQ

Although you may also just enjoy the "Retrieve" tab at uniprot.org.

ADD COMMENT
0
Entering edit mode

As I posted in the original question, I already have a Perl script to accomplish this task. I am trying to understand how to do it in other languages, such as C#.

ADD REPLY
0
Entering edit mode
7.3 years ago
Natasha ▴ 40

Hello Everyone!

I would like to convert the python script in this link to do the same task which the perl script posted here does. i.e I have a txt file with UNIPROT ID'S mentioned.I wish to obtain the protein sequence in FASTA file using a python script. Following this task,I would be using the sequence to calculate it's molecular weight.I have the script for calculating molecular weight in python.

Could someone help me with changing the PERL script in python?

Thanks a lot for your time and kind consideration.

Deepa

ADD COMMENT

Login before adding your answer.

Traffic: 1717 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6