Question

Eutilities protein ID to coding sequence in genome

1

Entering edit mode

8.6 years ago

davidrohanedgell ▴ 10

Thanks for any answers in advance. I have a set of ~1100 protein GIs from mitochondrial and plastid genomes, and am trying to use perl with eutilities to get the corresponding coding DNA sequences. My strategy has been to use elink -> efetch. Elink gives me the associated GI number for the coding DNA sequence, but have run into the problem that the GI links to the entire genome sequence, and not the particular coding region.

For instance, http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=protein&db=nuccore&id=674840670 gets me the GI 674840664, which is the GI for the genome, not the coding region of the protein of interest.

The coding region is specific by NC_024755.1:13737..14474, which is the RefSeq id.

Is there a way to link directly to the defined CDS region in the genome? I am likely missing something very obvious here.

Thanks

eutils • 2.3k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.6 years ago by davidrohanedgell ▴ 10

Ram · Answer 1 · 2015-09-10

Posting an answer to my own question - figured it out after beating my head against the wall for a day or so.

Solution is to go through esummary after the elink. From the esummary the chromosome positions and coding strand can be parsed out. These features are then used in the efetch call to retrieve the DNA sequence. Here's my Perl script (sorry, not a BioPerl type of guy).

#!/usr/bin/env perl -w
use strict;
use LWP::Simple;

my $protein_id = "256427307";
my $base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
my $url = $base . "elink.fcgi?dbfrom=protein&db=gene&id=$protein_id";
my $output = get($url);
#print "$output";

my $linked_gi;
while ($output =~ /<LinkSetDb>(.*?)<\/LinkSetDb>/sg) {
    my $linkset = $1;
    if ($linkset =~ /<Id>(\d+)<\/Id>/sg) {
        $linked_gi = $1;
    }
    #print "linked GI = $linked_gi\n";
}

my $url3 = $base . "esummary.fcgi?db=gene&id=$linked_gi";
my $docsum = get($url3);
#print "$docsum\n";

my $start;my$stop;my$strand=1;my$chr_ver;
while ($docsum =~ /<GenomicInfo>(.*?)<\/GenomicInfo>/sg) {
    my $data= $1; print "$data";
    if ($data =~ /<ChrAccVer>(.*)<\/ChrAccVer>/sg) {
        $chr_ver =$1;
    }
    if ($data =~ /<ChrStart>(\d+?)<\/ChrStart>/sg) {
        $start =$1;
    }
    if ($data =~ /<ChrStop>(\d+?)<\/ChrStop>/sg) {
        $stop =$1;
    }
    $strand = 2 if $start > $stop;
}

print "$chr_ver:$start _ $stop\n";

my $url4 = $base."efetch.fcgi?db=nucleotide&id=$chr_ver&rettype=fasta&retmode=text&seq_start=$start&seq_stop=$stop&strand=$strand";
my $seq = get($url4);
print "$seq\n";

exit;