Question: Problem with downloading CDS sequences using Biopython
1
gravatar for cl10101
4.2 years ago by
cl1010180
cl1010180 wrote:

I'm trying to download CDS sequences for a given genome using Biopython. My script looks like this:

from Bio import Entrez 
from Bio import SeqIO

Entrez.email     = "c...@gmail.com"  
genomeAccessions = ['NC_021353.1', 'NC_020913.1']
handle = Entrez.efetch(db="nucleotide", id=genomeAccessions, rettype="gb")
records = SeqIO.parse(handle, "gb")

for i,record in enumerate(records):

    print(len(record.features))

    for feature in record.features:
        if feature.type == "CDS":
            print feature.location
            print feature.qualifiers["protein_id"]
            print feature.location.extract(record).seq

But using this code I get only one feature (for example for genome NC_021353) even though there are many features http://www.ncbi.nlm.nih.gov/nuccore/NC_021353.

I would be grateful for any suggestion what I'm doing wrong.

entrez biopython ncbi • 1.9k views
ADD COMMENTlink modified 4.2 years ago by Markus290 • written 4.2 years ago by cl1010180
1

I think something similar has been already discussed here:

But this solution is not in Biopython.

Download cds region coordinates

This one is.

How To Extract Just 'Cds' From Genbank File Into Another Genbank File?

There are some other suggestions here:

BioPython error parsing standard GenBank file

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by natasha.sernova3.7k
2
gravatar for Markus
4.2 years ago by
Markus290
Markus290 wrote:

Your rettype must be "gbwithparts", this should work:

handle = Entrez.efetch(db="nucleotide", id=genomeAccessions, rettype="gbwithparts")

Running your program:

>>> 
======================== RESTART: D:/Desktop/test.py ========================
3823
[580:1822](+)
['WP_020447944.1']
ATGGACAACATCAGCGGCAACATCTTTG...
...
[1818:2619](-)
['WP_020447945.1']
GTGAATAAAGAGAACAAATCTACTTTAA...
...

Your code will break later, possibly there are entries without a protein_id:

Traceback (most recent call last):
  File "D:/Desktop/test.py", line 20, in <module>
    print feature.qualifiers["protein_id"]
KeyError: 'protein_id'
ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by Markus290
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1950 users visited in the last hour