Retrieving CDS sequence for a protein genbank ID
0
0
Entering edit mode
6.9 years ago
alslonik ▴ 310

Hi Community,

I am new here, and new to Python, so hopefully my question is eligible and clear. I have blastP results with protein accession IDs. I am using biopython to fetch the protein sequence, but I can't find a way to fetch the corresponding CDS from given GenPept file.

When I use the python script below , it works well for proteins like AAY99645.1, which have the CDS portion look like: CDS 1..1196 /gene="AHI1" /coded_by="DQ090887.1:339..3929"

The problem is that for a very large part of proteins that I need there is no CDS link from the protein entry (i.e WP_020126478.1). Therefore nothing is retrieved. The only way to get to CDS in these cases is to go to the "related info" link, parse it and find a cds portion there, as part of a whole shotgun genome sequence.

Is there something that I miss? Or is this the only way?

This is the code I use :

import re
import urllib
import fileinput

prots = [];
for line in fileinput.input():
    line = line.rstrip('\r\n')
    prots.append(line)

for i in range(len(prots)):
    f = urllib.urlopen("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&rettype=gb&id=" + prots[i])
    for line in f.read().split('\n'):
        m = re.search(r'coded_by=\"(.+)\"$', line);
        if m is not None:
            target = m.group(1)
            accn = ""
            start = 0
            end = 0
            rc = False
            seq = ""
            m = re.search(r'^(.+):(\d+)\.\.(\d+)$', target)
            if m is not None:
                accn = m.group(1)
                start = m.group(2)
                end = m.group(3)
            else:
                m = re.search(r'^complement\((.+):(\d+)\.\.(\d+)\)$', target)
                if m is not None:
                    accn = m.group(1)
                    start = m.group(2)
                    end = m.group(3)
                    rc = True
            f = urllib.urlopen("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=fasta&id="   + accn + "&from=" + str(start) + "&to=" + str(end))
            for row in f.read().split('\n'):
                if len(row) == 0:
                    continue;
                if row[0] == ">":
                    continue;
                seq += row;
            if rc == True:
                revseq = seq[::-1]
                rcseq = ""
                d = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C'}
                for char in revseq:
                    rcseq += d[char]
                seq = rcseq
            print ">" + prots[i] + "|" + target + "\n" + seq
biopython genbank • 2.2k views
ADD COMMENT
0
Entering edit mode

WP* records

represents a single, non-redundant, protein sequence which may be annotated on many different RefSeq genomes from the same, or different, species

So it is not surprising that you don't get a singe CDS sequence.

ADD REPLY
0
Entering edit mode

Not sure I understand... My specific example is annotated to a single genome shotgun sequence, still the record on protein database is not linked directly to its cds, which makes the job of retrieving it much more complicated than it is for other records..

ADD REPLY

Login before adding your answer.

Traffic: 2573 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6