Question

Retrieving CDS sequence for a protein genbank ID

0

Entering edit mode

6.9 years ago

alslonik ▴ 310

Hi Community,

I am new here, and new to Python, so hopefully my question is eligible and clear. I have blastP results with protein accession IDs. I am using biopython to fetch the protein sequence, but I can't find a way to fetch the corresponding CDS from given GenPept file.

When I use the python script below , it works well for proteins like AAY99645.1, which have the CDS portion look like: CDS 1..1196 /gene="AHI1" /coded_by="DQ090887.1:339..3929"

The problem is that for a very large part of proteins that I need there is no CDS link from the protein entry (i.e WP_020126478.1). Therefore nothing is retrieved. The only way to get to CDS in these cases is to go to the "related info" link, parse it and find a cds portion there, as part of a whole shotgun genome sequence.

Is there something that I miss? Or is this the only way?

This is the code I use :

import re
import urllib
import fileinput

prots = [];
for line in fileinput.input():
    line = line.rstrip('\r\n')
    prots.append(line)

for i in range(len(prots)):
    f = urllib.urlopen("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&rettype=gb&id=" + prots[i])
    for line in f.read().split('\n'):
        m = re.search(r'coded_by=\"(.+)\"$', line);
        if m is not None:
            target = m.group(1)
            accn = ""
            start = 0
            end = 0
            rc = False
            seq = ""
            m = re.search(r'^(.+):(\d+)\.\.(\d+)$', target)
            if m is not None:
                accn = m.group(1)
                start = m.group(2)
                end = m.group(3)
            else:
                m = re.search(r'^complement\((.+):(\d+)\.\.(\d+)\)$', target)
                if m is not None:
                    accn = m.group(1)
                    start = m.group(2)
                    end = m.group(3)
                    rc = True
            f = urllib.urlopen("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=fasta&id="   + accn + "&from=" + str(start) + "&to=" + str(end))
            for row in f.read().split('\n'):
                if len(row) == 0:
                    continue;
                if row[0] == ">":
                    continue;
                seq += row;
            if rc == True:
                revseq = seq[::-1]
                rcseq = ""
                d = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C'}
                for char in revseq:
                    rcseq += d[char]
                seq = rcseq
            print ">" + prots[i] + "|" + target + "\n" + seq

biopython genbank • 2.2k views

ADD COMMENT • link 6.9 years ago by alslonik ▴ 310

0

Entering edit mode

WP* records

represents a single, non-redundant, protein sequence which may be annotated on many different RefSeq genomes from the same, or different, species

So it is not surprising that you don't get a singe CDS sequence.

ADD REPLY • link 6.9 years ago by GenoMax 141k

0

Entering edit mode

Not sure I understand... My specific example is annotated to a single genome shotgun sequence, still the record on protein database is not linked directly to its cds, which makes the job of retrieving it much more complicated than it is for other records..

ADD REPLY • link 6.9 years ago by alslonik ▴ 310