Eutilities protein ID to coding sequence in genome
Entering edit mode
6.4 years ago

Thanks for any answers in advance. I have a set of ~1100 protein GIs from mitochondrial and plastid genomes, and am trying to use perl with eutilities to get the corresponding coding DNA sequences. My strategy has been to use elink -> efetch. Elink gives me the associated GI number for the coding DNA sequence, but have run into the problem that the GI links to the entire genome sequence, and not the particular coding region. 

For instance,

gets me the GI 674840664, which is the GI for the genome, not the coding region of the protein of interest.

The coding region is specific by  NC_024755.1:13737..14474, which is the RefSeq id.

Is there a way to link directly to the defined CDS region in the genome? I am likely missing something very obvious here.


eutilities protein_to_coding_sequence • 1.9k views
Entering edit mode
6.4 years ago

Posting an answer to my own question - figured it out after beating my head against the wall for a day or so. 

Solution is to go through esummary after the elink. From the esummary the chromosome positions and coding strand can be parsed out. These features are then used in the efetch call to retrieve the DNA sequence. Here's my Perl script (sorry, not a BioPerl type of guy).

#!/usr/bin/env perl -w
use strict;
use LWP::Simple;

my $protein_id = "256427307";
my $base = '';
my $url = $base . "elink.fcgi?dbfrom=protein&db=gene&id=$protein_id";
my $output = get($url);
#print "$output";

my $linked_gi;
while ($output =~ /<LinkSetDb>(.*?)<\/LinkSetDb>/sg) {
    my $linkset = $1;
    if ($linkset =~ /<Id>(\d+)<\/Id>/sg) {
        $linked_gi = $1;
    #print "linked GI = $linked_gi\n";

my $url3 = $base . "esummary.fcgi?db=gene&id=$linked_gi";
my $docsum = get($url3);
#print "$docsum\n";

my $start;my$stop;my$strand=1;my$chr_ver;
while ($docsum =~ /<GenomicInfo>(.*?)<\/GenomicInfo>/sg) {
    my $data= $1; print "$data";
    if ($data =~ /<ChrAccVer>(.*)<\/ChrAccVer>/sg) {
        $chr_ver =$1;
    if ($data =~ /<ChrStart>(\d+?)<\/ChrStart>/sg) {
        $start =$1;
    if ($data =~ /<ChrStop>(\d+?)<\/ChrStop>/sg) {
        $stop =$1;
    $strand = 2 if $start > $stop;

print "$chr_ver:$start _ $stop\n";

my $url4 = $base."efetch.fcgi?db=nucleotide&id=$chr_ver&rettype=fasta&retmode=text&seq_start=$start&seq_stop=$stop&strand=$strand";
my $seq = get($url4);
print "$seq\n";


Entering edit mode

Many thanks for posting the solution, however does it extract the exact coding sequence for a particular protein?

For example, if we look at this protein. This is the exact coding sequence for that protein. But when I run your script, changing "protein_ID" to 635010873 (the GI number for the protein), it returns this, the full mRNA sequence for this region, but not the coding sequence for the protein of interest?

Basically, my aim is to obtain the longest canonical transcript for each gene. I can obtain a list of proteins and their lengths, and then I was thinking of extracting the longest protein's coding sequence to give me the longest canonical transcript. I've also asked similar questions here and here; if you could shed any light on the topic!


Login before adding your answer.

Traffic: 3689 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6