Question: Retrieve Amino Acid Sequence From Mrna Accession Number
0
gravatar for prajwalnj
5.7 years ago by
prajwalnj10
USA
prajwalnj10 wrote:

Hello,

I have a set of 500 mRNA accession numbers

NM_000247
NM_000500
NM_000694
NM_000947
...

and I would like to retrieve the aa sequence.

>NM_000247 
MGLGPVFLLLAGIFPFAPPGAAAEPHSLRYNLTVLSWDGSVQSG
FLTEVHLDGQPFLRCDRQKCRAKPQGQWAEDVLGNKTWDRETRDLTGNGKDLRMTLAH
IKDQKEGLHSLQEIRVCEIHEDNSTRSSQHFYYDGELFLSQNLETKEWTMPQSSRAQT
LAMNVRNFLKEDAMKTKTHYHAMHADCLQELRRYLKSGVVLRRTVPPMVNVTRSEASE....

I used this script

use LWP::Simple;
use URI::URL;

if(@ARGV != 3) {
  print "Usage: perl test.pl < database > < id > < your e-mail >\n";
exit(0);
}

$database = $ARGV[0];

$id = $ARGV[1];
$email = $ARGV[2];
$address = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi";

$parameter = {"db" => $database,
            "id" => $id,
            "retmode" => "text",
            "rettype" => "gp",
            "email" => $email};

$url = url($address);
$url->query_form($parameter);

$result = get($url);
print $result;

But this is possible for a single id at a time and gives me a lot more information. How can I upload a list and retrieve only the aa sequence store the results in a file ?

Thank you in advance,

Prajwal

script • 2.2k views
ADD COMMENTlink modified 5.7 years ago by shane.neeley50 • written 5.7 years ago by prajwalnj10
5
gravatar for Pierre Lindenbaum
5.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum109k wrote:

try to use XSLT to extract the protein from the genbank/xml record:

$ echo -e "NM_000500\nNM_000694\nNM_000947" | while read ACN ; do curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=${ACN}&retmode=xml" | xsltproc --novalid stylesheet.xsl -  ; done
>NP_000491.4
MLLLGLLLLLPLLAGARLLWNWWKLRSLHLPPLAPGFLHLLQPDLPIYLLGLTQKFGPIYRLHLGLQDVVVLNSKRTIEEAMVKKWADFAGRPEPLTYKLVSRNYPDLSLGDYSLLWKAHKKLTRSALLLGIRDSMEPVVEQLTQEFCERMRAQPGTPVAIEEEFSLLTCSIICYLTFGDKIKDDNLMPAYYKCIQEVLKTWSHWSIQIVDVIPFLRFFPNPGLRRLKQAIEKRDHIVEMQLRQHKESLVAGQWRDMMDYMLQGVAQPSMEEGSGQLLEGHVHMAAVDLLIGGTETTANTLSWAVVFLLHHPEIQQRLQEELDHELGPGASSSRVPYKDRARLPLLNATIAEVLRLRPVVPLALPHRTTRPSSISGYDIPEGTVIIPNLQGAHLDETVWERPHEFWPDRFLEPGKNSRALAFGCGARVCLGEPLARLELFVVLTRLLQAFTLLPSGDALPSLQPLPHCSVILKMQPFQVRLQPRGMGAHSPGQSQ
>NP_000685.1
MDPLGDTLRRLREAFHAGRTRPAEFRAAQLQGLGRFLQENKQLLHDALAQDLHKSAFESEVSEVAISQGEVTLALRNLRAWMKDERVPKNLATQLDSAFIRKEPFGLVLIIAPWNYPLNLTLVPLVGALAAGNCVVLKPSEISKNVEKILAEVLPQYVDQSCFAVVLGGPQETGQLLEHRFDYIFFTGSPRVGKIVMTAAAKHLTPVTLELGGKNPCYVDDNCDPQTVANRVAWFRYFNAGQTCVAPDYVLCSPEMQERLLPALQSTITRFYGDDPQSSPNLGRIINQKQFQRLRALLGCGRVAIGGQSDESDRYIAPTVLVDVQEMEPVMQEEIFGPILPIVNVQSLDEAIEFINRREKPLALYAFSNSSQVVKRVLTQTSSGGFCGNDGFMHMTLASLPFGGVGASGMGRYHGKFSFDTFSHHRACLLRSPGMEKLNALRYPPQSPRRLRMLLVAMEAQGCSCTLL
>NP_000938.2
MEFSGRKWRKLRLAGDQRNASYPHCLQFYLQPPSENISLIEFENLAIDRVKLLKSVENLGVSYVKGTEQYQSKLESELRKLKFSYRENLEDEYEPRRRDHISHFILRLAYCQSEELRRWFIQQEMDLLRFRFSILPKDKIQDFLKDSQLQFEAISDEEKTLREQEIVASSPSLSGLKLGFESIYKIPFADALDLFRGRKVYLEDGFAYVPLKDIVAIILNEFRAKLSKALALTARSLPAVQSDERLQPLLNHLSHSYTGQDYSTQGNVGKISLDQIDLLSTKSFPPCMRQLHKALRENHHLRHGGRMQYGLFLKGIGLTLEQALQFWKQEFIKGKMDPDKFDKGYSYNIRHSFGKEGKRTDYTPFSCLKIILSNPPSQGDYHGCPFRHSDPELLKQKLQSYKISPGGISQILDLVKGTHYQVACQKYFEMIHNVDDCGFSLNHPNQFFCESQRILNGGKDIKKEPIQPETPQPKPSVQKTKDASSALASLNSSLEMDMEGLEDYFSEDS

with stylesheet.xsl :


<xsl:stylesheet xmlns:xsl="&lt;a href=" <a="" href="http://www.w3.org/1999/XSL/Transform" rel="nofollow">http://www.w3.org/1999/XSL/Transform" "="" rel="nofollow">http://www.w3.org/1999/XSL/Transform' version='1.0' >


<xsl:output method="text" encoding="UTF-8"/>


<xsl:template match="/">
<xsl:for-each select="//GBQualifier[GBQualifier_name='translation'][GBQualifier_value]">
<xsl:text>></xsl:text>
<xsl:choose>
  <xsl:when test="../GBQualifier[GBQualifier_name='protein_id']">
    <xsl:value-of select="../GBQualifier[GBQualifier_name='protein_id']/GBQualifier_value"/>
  </xsl:when>
  <xsl:when test="../GBQualifier[GBQualifier_name='product']">
    <xsl:value-of select="../GBQualifier[GBQualifier_name='product']/GBQualifier_value"/>
  </xsl:when>
  <xsl:otherwise>
    <xsl:value-of select="generate-id(.)"/>
  </xsl:otherwise>
</xsl:choose>
<xsl:text>
</xsl:text>
<xsl:value-of select="GBQualifier_value"/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>


</xsl:stylesheet>
ADD COMMENTlink written 5.7 years ago by Pierre Lindenbaum109k

Thanks a lot Pierre, this has worked for me!

ADD REPLYlink written 5.7 years ago by prajwalnj10
0
gravatar for shane.neeley
5.7 years ago by
shane.neeley50
Portland, Oregon
shane.neeley50 wrote:

I know your question is answered, but I thought I would post this script for others interested in doing iterative sequence retrievals. You can try this BioPerl E-utils script. This example is the same as searching for 'crab' in the protein database, but it will save all sequences it finds.

########## <http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook> #########

#!/usr/bin/perl -w

BEGIN {push @INC,"path/to/BioPerl";}
use Bio::DB::EUtilities;
# set optional history queue
my $factory = Bio::DB::EUtilities->new(-eutil      => 'esearch',
                                       -email      => 'mymail@foo.bar',
                                       -db         => 'protein',
                                       -term       => 'crab',
                                       -usehistory => 'y');

my $count = $factory->get_count;
# get history from queue
my $hist  = $factory->next_History || die 'No history data returned';
print "History returned\n";
# note db carries over from above
$factory->set_parameters(-eutil   => 'efetch',
                         -rettype => 'fasta',
                         -history => $hist);

my $retry = 0;
my ($retmax, $retstart) = (500,0);

open (my $out, '>', 'lots_of_crab_sequences.fa') || die "Can't open file:$!";

RETRIEVE_SEQS:
while ($retstart < $count) {
    $factory->set_parameters(-retmax   => $retmax,
                             -retstart => $retstart);
    eval{
        $factory->get_Response(-cb => sub {my ($data) = @_; print $out $data} );
    };
    if ($@) {
        die "Server error: $@.  Try again later" if $retry == 5;
        print STDERR "Server error, redo #$retry\n";
        $retry++ && redo RETRIEVE_SEQS;
    }
    #say "Retrieved $retstart";
    $retstart += $retmax;
}

close $out;
ADD COMMENTlink written 5.7 years ago by shane.neeley50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1003 users visited in the last hour