Question: Retrieving CDS sequence for a protein genbank ID
gravatar for alslonik
3.5 years ago by
alslonik150 wrote:

Hi Community,

I am new here, and new to Python, so hopefully my question is eligible and clear. I have blastP results with protein accession IDs. I am using biopython to fetch the protein sequence, but I can't find a way to fetch the corresponding CDS from given GenPept file.

When I use the python script below , it works well for proteins like AAY99645.1, which have the CDS portion look like: CDS 1..1196 /gene="AHI1" /coded_by="DQ090887.1:339..3929"

The problem is that for a very large part of proteins that I need there is no CDS link from the protein entry (i.e WP_020126478.1). Therefore nothing is retrieved. The only way to get to CDS in these cases is to go to the "related info" link, parse it and find a cds portion there, as part of a whole shotgun genome sequence.

Is there something that I miss? Or is this the only way?

This is the code I use :

import re
import urllib
import fileinput

prots = [];
for line in fileinput.input():
    line = line.rstrip('\r\n')

for i in range(len(prots)):
    f = urllib.urlopen("" + prots[i])
    for line in'\n'):
        m ='coded_by=\"(.+)\"$', line);
        if m is not None:
            target =
            accn = ""
            start = 0
            end = 0
            rc = False
            seq = ""
            m ='^(.+):(\d+)\.\.(\d+)$', target)
            if m is not None:
                accn =
                start =
                end =
                m ='^complement\((.+):(\d+)\.\.(\d+)\)$', target)
                if m is not None:
                    accn =
                    start =
                    end =
                    rc = True
            f = urllib.urlopen(""   + accn + "&from=" + str(start) + "&to=" + str(end))
            for row in'\n'):
                if len(row) == 0:
                if row[0] == ">":
                seq += row;
            if rc == True:
                revseq = seq[::-1]
                rcseq = ""
                d = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C'}
                for char in revseq:
                    rcseq += d[char]
                seq = rcseq
            print ">" + prots[i] + "|" + target + "\n" + seq
genbank biopython • 1.2k views
ADD COMMENTlink written 3.5 years ago by alslonik150

WP* records

represents a single, non-redundant, protein sequence which may be annotated on many different RefSeq genomes from the same, or different, species

So it is not surprising that you don't get a singe CDS sequence.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by genomax92k

Not sure I understand... My specific example is annotated to a single genome shotgun sequence, still the record on protein database is not linked directly to its cds, which makes the job of retrieving it much more complicated than it is for other records..

ADD REPLYlink written 3.5 years ago by alslonik150
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1162 users visited in the last hour