Question

How To Programmatically Retrieve A Batch Of Fasta Sequences From For A List Of Uniprot Accession Ids?

3

Entering edit mode

10.3 years ago

Superbest ▴ 130

I have a list of UniProtKB Protein Accessions. I want to fetch the peptide sequence corresponding to each accession in fasta format. Preferably, all the fasta sequences should be combined into a single file.

I would like to do this programmatically, with C#.

UniProt's FAQ explains how to do it. However, the code example they give is Perl:

use strict;
use warnings;
use LWP::UserAgent;

my $list = $ARGV[0]; # File containg list of UniProt identifiers.

my $base = 'http://www.uniprot.org';
my $tool = 'batch';

my $contact = ''; # Please set your email address here to help us debug in case of problems.
my $agent = LWP::UserAgent->new(agent => "libwww-perl $contact");
push @{$agent->requests_redirectable}, 'POST';

my $response = $agent->post("$base/$tool/",
                            [ 'file' => [$list],
                              'format' => 'txt',
                            ],
                            'Content_Type' => 'form-data');

while (my $wait = $response->header('Retry-After')) {
  print STDERR "Waiting ($wait)...\n";
  sleep $wait;
  $response = $agent->get($response->base);
}

$response->is_success ?
  print $response->content :
  die 'Failed, got ' . $response->status_line .
    ' for ' . $response->request->uri . "\n";

I don't know Perl (and would prefer not having to learn it). I can see that it submits the required identifier with an HTTP POST, but I can't figure out what the HTTP POST is supposed to contain (it would be really helpful if I had URL that I could paste into my browser to manually download the file). I tried using Fiddler to intercept the request, but Strawberry Perl manages to circumvent Fiddler.

How would I accomplish the same thing this code is doing in C#? What is the correct way to submit the web request?

uniprot perl webservice • 11k views

ADD COMMENT • link updated 6.8 years ago by Natasha ▴ 40 • written 10.3 years ago by Superbest ▴ 130

1

Entering edit mode

how about using http://www.uniprot.org/help/batch ?

ADD REPLY • link 10.3 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I wish to do it programmatically, not manually.

ADD REPLY • link 10.3 years ago by Superbest ▴ 130

0

Entering edit mode

You can do using one liner given by UCSC_UTILITIES. Once you download this use faSomeRecords

faSomeRecords in.fa listFile out.fa

ADD REPLY • link 6.8 years ago by Chirag Parsania ★ 2.0k

score 0 · Answer 1 · 2014-01-08

0

Entering edit mode

10.3 years ago

bilouorama • 0

An other way is simply to get each web page http://www.uniprot.org/uniprot/P12345.fasta with libcurl or another library. It's slow because you ask each file independently.

You can also get pages through a wget or curl using the batch retreive service : http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=uniprotkb&id=P14060+P26439+P27364+Q62878+Q61767+Q61694+P14893+P22071+P22072+P24815+Q91997+Q67477+Q91381+Q98318+P21097+P26670&format=fasta&style=raw&Retrieve=Retrieve

With this command I get all the sequences in fasta format :-)

I don't know if a C# library exist like libcurl for C. But I hope so.

ADD COMMENT • link 10.3 years ago by bilouorama • 0

0

Entering edit mode

This code, taken from the "Mapping database identifiers" section, maps identifiers from one database to another. I want to download the fasta sequences corresponding to the identifiers.

ADD REPLY • link 10.3 years ago by Superbest ▴ 130

0

Entering edit mode

You're right... I confused... I change my answer.

ADD REPLY • link 10.3 years ago by bilouorama • 0

score 0 · Answer 2 · 2014-01-08

Perhaps the following will be helpful:

use strict;
use warnings;
use LWP::Simple;

@ARGV == 2 or die "Usage: perl $0 accessions.txt outFile.fasta\n";

my $dashes  = '-' x 25 . "\n";
my $outFile = pop;
my %list    = map { chomp; $_ => 1 } <>;

my $accessions = join '+OR+', keys %list;
my $fastaRecs =
  get "http://www.uniprot.org/uniprot/?query=$accessions&force=yes&format=fasta"
  or die "No records returned.\n";

open my $fh, '>', $outFile or die $!;

for ( split />/, $fastaRecs ) {
    if ( /\|(.+?)\|/ and exists $list{$1} ) {
        print $fh ">$_";
        delete $list{$1};
    }
}

close $fh;

print $dashes;

if ( !keys %list ) {
    print "All accessions retrieved.\n";
}
else {
    print "Accessions not retrieved:\n";
    print "$_\n" for keys %list;
}

print $dashes;

Usage: perl script.pl accessions.txt outFile.fasta

The script builds an "OR" query based upon the accessions, checks that only those requested are finally printed to a file, and tracks those retrieved. The results are printed, and show whether all accessions have been retrieved or the missing accessions.

For example, when accessions.txt contained:

A2BC19
P12345
P69905

the output file contains:

>sp|P12345|AATM_RABIT Aspartate aminotransferase, mitochondrial OS=Oryctolagus cuniculus GN=GOT2 PE=1 SV=2
MALLHSARVLSGVASAFHPGLAAAASARASSWWAHVEMGPPDPILGVTEAYKRDTNSKKM
NLGVGAYRDDNGKPYVLPSVRKAEAQIAAKGLDKEYLPIGGLAEFCRASAELALGENSEV
VKSGRFVTVQTISGTGALRIGASFLQRFFKFSRDVFLPKPSWGNHTPIFRDAGMQLQSYR
YYDPKTCGFDFTGALEDISKIPEQSVLLLHACAHNPTGVDPRPEQWKEIATVVKKRNLFA
FFDMAYQGFASGDGDKDAWAVRHFIEQGINVCLCQSYAKNMGLYGERVGAFTVICKDADE
AKRVESQLKILIRPMYSNPPIHGARIASTILTSPDLRKQWLQEVKGMADRIIGMRTQLVS
NLKKEGSTHSWQHITDQIGMFCFTGLKPEQVERLTKEFSIYMTKDGRISVAGVTSGNVGY
LAHAIHQVTK
>sp|P69905|HBA_HUMAN Hemoglobin subunit alpha OS=Homo sapiens GN=HBA1 PE=1 SV=2
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG
KKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTP
AVHASLDKFLASVSTVLTSKYR
>tr|A2BC19|A2BC19_HELPX GTPase (Fragment) OS=Helicobacter pylori GN=yphC PE=4 SV=1
NISHKTLKTIAILGQPNVGKSSLFNRLARERIAITSDFAGTTRDINKRKIALNGHEVELL
DTGGMAKDALLSKEIKALNLKAAQMSDLILYVVDGKSIPSDEDIKLFREVFKTNPNCFLV
INKIDNDKEKERAYAFSSFGAPKSFNISVSHNRGISALIDAVLNALNLNQ

Although you may also just enjoy the "Retrieve" tab at uniprot.org.

score 0 · Answer 3 · 2017-07-09

Hello Everyone!

I would like to convert the python script in this link to do the same task which the perl script posted here does. i.e I have a txt file with UNIPROT ID'S mentioned.I wish to obtain the protein sequence in FASTA file using a python script. Following this task,I would be using the sequence to calculate it's molecular weight.I have the script for calculating molecular weight in python.

Could someone help me with changing the PERL script in python?

Thanks a lot for your time and kind consideration.

Deepa