Question: How To Programmatically Retrieve A Batch Of Fasta Sequences From For A List Of Uniprot Accession Ids?
2
gravatar for Superbest
4.3 years ago by
Superbest110
United States
Superbest110 wrote:

I have a list of UniProtKB Protein Accessions. I want to fetch the peptide sequence corresponding to each accession in fasta format. Preferably, all the fasta sequences should be combined into a single file.

I would like to do this programmatically, with C#.

UniProt's FAQ explains how to do it. However, the code example they give is Perl:

use strict;
use warnings;
use LWP::UserAgent;

my $list = $ARGV[0]; # File containg list of UniProt identifiers.

my $base = 'http://www.uniprot.org';
my $tool = 'batch';

my $contact = ''; # Please set your email address here to help us debug in case of problems.
my $agent = LWP::UserAgent->new(agent => "libwww-perl $contact");
push @{$agent->requests_redirectable}, 'POST';

my $response = $agent->post("$base/$tool/",
                            [ 'file' => [$list],
                              'format' => 'txt',
                            ],
                            'Content_Type' => 'form-data');

while (my $wait = $response->header('Retry-After')) {
  print STDERR "Waiting ($wait)...\n";
  sleep $wait;
  $response = $agent->get($response->base);
}

$response->is_success ?
  print $response->content :
  die 'Failed, got ' . $response->status_line .
    ' for ' . $response->request->uri . "\n";

I don't know Perl (and would prefer not having to learn it). I can see that it submits the required identifier with an HTTP POST, but I can't figure out what the HTTP POST is supposed to contain (it would be really helpful if I had URL that I could paste into my browser to manually download the file). I tried using Fiddler to intercept the request, but Strawberry Perl manages to circumvent Fiddler.

How would I accomplish the same thing this code is doing in C#? What is the correct way to submit the web request?

perl webservice uniprot • 6.3k views
ADD COMMENTlink modified 9 months ago by deepamahm.iisc20 • written 4.3 years ago by Superbest110
1

how about using http://www.uniprot.org/help/batch ?

ADD REPLYlink written 4.3 years ago by Pierre Lindenbaum106k

I wish to do it programmatically, not manually.

ADD REPLYlink written 4.3 years ago by Superbest110

You can do using one liner given by UCSC_UTILITIES. Once you download this use faSomeRecords

faSomeRecords in.fa listFile out.fa
ADD REPLYlink written 9 months ago by Chirag Parsania430
0
gravatar for bilouorama
4.3 years ago by
bilouorama0
bilouorama0 wrote:

An other way is simply to get each web page http://www.uniprot.org/uniprot/P12345.fasta with libcurl or another library. It's slow because you ask each file independently.

You can also get pages through a wget or curl using the batch retreive service : http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=uniprotkb&id=P14060+P26439+P27364+Q62878+Q61767+Q61694+P14893+P22071+P22072+P24815+Q91997+Q67477+Q91381+Q98318+P21097+P26670&format=fasta&style=raw&Retrieve=Retrieve

With this command I get all the sequences in fasta format :-)

I don't know if a C# library exist like libcurl for C. But I hope so.

ADD COMMENTlink modified 4.3 years ago • written 4.3 years ago by bilouorama0

This code, taken from the "Mapping database identifiers" section, maps identifiers from one database to another. I want to download the fasta sequences corresponding to the identifiers.

ADD REPLYlink written 4.3 years ago by Superbest110

You're right... I confused... I change my answer.

ADD REPLYlink written 4.3 years ago by bilouorama0
0
gravatar for Kenosis
4.3 years ago by
Kenosis1.2k
Kenosis1.2k wrote:

Perhaps the following will be helpful:

use strict;
use warnings;
use LWP::Simple;

@ARGV == 2 or die "Usage: perl $0 accessions.txt outFile.fasta\n";

my $dashes  = '-' x 25 . "\n";
my $outFile = pop;
my %list    = map { chomp; $_ => 1 } <>;

my $accessions = join '+OR+', keys %list;
my $fastaRecs =
  get "http://www.uniprot.org/uniprot/?query=$accessions&force=yes&format=fasta"
  or die "No records returned.\n";

open my $fh, '>', $outFile or die $!;

for ( split />/, $fastaRecs ) {
    if ( /\|(.+?)\|/ and exists $list{$1} ) {
        print $fh ">$_";
        delete $list{$1};
    }
}

close $fh;

print $dashes;

if ( !keys %list ) {
    print "All accessions retrieved.\n";
}
else {
    print "Accessions not retrieved:\n";
    print "$_\n" for keys %list;
}

print $dashes;

Usage: perl script.pl accessions.txt outFile.fasta

The script builds an "OR" query based upon the accessions, checks that only those requested are finally printed to a file, and tracks those retrieved. The results are printed, and show whether all accessions have been retrieved or the missing accessions.

For example, when accessions.txt contained:

A2BC19
P12345
P69905

the output file contains:

>sp|P12345|AATM_RABIT Aspartate aminotransferase, mitochondrial OS=Oryctolagus cuniculus GN=GOT2 PE=1 SV=2
MALLHSARVLSGVASAFHPGLAAAASARASSWWAHVEMGPPDPILGVTEAYKRDTNSKKM
NLGVGAYRDDNGKPYVLPSVRKAEAQIAAKGLDKEYLPIGGLAEFCRASAELALGENSEV
VKSGRFVTVQTISGTGALRIGASFLQRFFKFSRDVFLPKPSWGNHTPIFRDAGMQLQSYR
YYDPKTCGFDFTGALEDISKIPEQSVLLLHACAHNPTGVDPRPEQWKEIATVVKKRNLFA
FFDMAYQGFASGDGDKDAWAVRHFIEQGINVCLCQSYAKNMGLYGERVGAFTVICKDADE
AKRVESQLKILIRPMYSNPPIHGARIASTILTSPDLRKQWLQEVKGMADRIIGMRTQLVS
NLKKEGSTHSWQHITDQIGMFCFTGLKPEQVERLTKEFSIYMTKDGRISVAGVTSGNVGY
LAHAIHQVTK
>sp|P69905|HBA_HUMAN Hemoglobin subunit alpha OS=Homo sapiens GN=HBA1 PE=1 SV=2
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG
KKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTP
AVHASLDKFLASVSTVLTSKYR
>tr|A2BC19|A2BC19_HELPX GTPase (Fragment) OS=Helicobacter pylori GN=yphC PE=4 SV=1
NISHKTLKTIAILGQPNVGKSSLFNRLARERIAITSDFAGTTRDINKRKIALNGHEVELL
DTGGMAKDALLSKEIKALNLKAAQMSDLILYVVDGKSIPSDEDIKLFREVFKTNPNCFLV
INKIDNDKEKERAYAFSSFGAPKSFNISVSHNRGISALIDAVLNALNLNQ

Although you may also just enjoy the "Retrieve" tab at uniprot.org.

ADD COMMENTlink modified 4.3 years ago • written 4.3 years ago by Kenosis1.2k

As I posted in the original question, I already have a Perl script to accomplish this task. I am trying to understand how to do it in other languages, such as C#.

ADD REPLYlink written 4.3 years ago by Superbest110
0
gravatar for deepamahm.iisc
9 months ago by
deepamahm.iisc20 wrote:

Hello Everyone!

I would like to convert the python script in this link to do the same task which the perl script posted here does. i.e I have a txt file with UNIPROT ID'S mentioned.I wish to obtain the protein sequence in FASTA file using a python script. Following this task,I would be using the sequence to calculate it's molecular weight.I have the script for calculating molecular weight in python.

Could someone help me with changing the PERL script in python?

Thanks a lot for your time and kind consideration.

Deepa

ADD COMMENTlink written 9 months ago by deepamahm.iisc20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1580 users visited in the last hour