Question

Problem with downloading CDS sequences using Biopython

1

Entering edit mode

8.0 years ago

cl10101 ▴ 80

I'm trying to download CDS sequences for a given genome using Biopython. My script looks like this:

from Bio import Entrez 
from Bio import SeqIO

Entrez.email     = "c...@gmail.com"  
genomeAccessions = ['NC_021353.1', 'NC_020913.1']
handle = Entrez.efetch(db="nucleotide", id=genomeAccessions, rettype="gb")
records = SeqIO.parse(handle, "gb")

for i,record in enumerate(records):

    print(len(record.features))

    for feature in record.features:
        if feature.type == "CDS":
            print feature.location
            print feature.qualifiers["protein_id"]
            print feature.location.extract(record).seq

But using this code I get only one feature (for example for genome NC_021353) even though there are many features http://www.ncbi.nlm.nih.gov/nuccore/NC_021353.

I would be grateful for any suggestion what I'm doing wrong.

biopython ncbi entrez • 3.2k views

ADD COMMENT • link updated 8.0 years ago by Markus ▴ 320 • written 8.0 years ago by cl10101 ▴ 80

1

Entering edit mode

I think something similar has been already discussed here:

But this solution is not in Biopython.

Download cds region coordinates

This one is.

How To Extract Just 'Cds' From Genbank File Into Another Genbank File?

There are some other suggestions here:

BioPython error parsing standard GenBank file

ADD REPLY • link 8.0 years ago by natasha.sernova ★ 4.0k

score 2 · Accepted Answer · 2016-04-13

Your rettype must be "gbwithparts", this should work:

handle = Entrez.efetch(db="nucleotide", id=genomeAccessions, rettype="gbwithparts")

Running your program:

>>> 
======================== RESTART: D:/Desktop/test.py ========================
3823
[580:1822](+)
['WP_020447944.1']
ATGGACAACATCAGCGGCAACATCTTTG...
...
[1818:2619](-)
['WP_020447945.1']
GTGAATAAAGAGAACAAATCTACTTTAA...
...

Your code will break later, possibly there are entries without a protein_id:

Traceback (most recent call last):
  File "D:/Desktop/test.py", line 20, in <module>
    print feature.qualifiers["protein_id"]
KeyError: 'protein_id'