Question

Extract all CDS from genbank file

0

Entering edit mode

18 months ago

gimenezenrico ▴ 10

Hello, everyone!

I'd like to run OrthoFinder with nucleotide input instead of protein input, but the genomes I'm working with are from an old version, then Genbank doesn't allow me to download de CDS fasta file directly from the site. So I downloaded the gbk file and I'm trying, without success, to extract all CDS from it. If someone could help me I'd appreciate it.

Those are the gb files: https://www.ncbi.nlm.nih.gov/nuccore/NC_018019.2 https://www.ncbi.nlm.nih.gov/nuccore/NC_018019.1

from Bio import SeqIO

record = SeqIO.parse(
    "./NC_018019.2-complete.gb", "genbank")

record.features = [f for f in record.features if f.type == "CDS"]

SeqIO.write(cds, 'NC_01809.2_V2_nucleotide.fasta', 'fasta')

and i've the following error:

Traceback (most recent call last): File "/home/enrico/Documentos/materialcp162/dados/CDSs/nucleotideos/./getcds.py", line 7, in <module>
    record.features = [f for f in record.features if f.type == "CDS"]
AttributeError: 'GenBankIterator' object has no attribute 'features'

Thank you

fasta biopython genbank cds python • 2.6k views

ADD COMMENT • link updated 16 months ago by Alban Nabla ▴ 30 • written 18 months ago by gimenezenrico ▴ 10

1

Entering edit mode

16 months ago

Alban Nabla ▴ 30

This was already neatly solved above, but I thought it could be a good exercise to show how to do what requested in BioPython:

from Bio import SeqIO

genome = next(SeqIO.parse("NC_018019.2.gb", "genbank"))

cds = []
for feature in genome.features:
    if feature.type == 'CDS':
    cds.append(feature)

seqs = []
for feature in cds:
    seq = genome[int(feature.location.start):int(feature.location.end)]
    if feature.location.strand == -1: 
        seq.seq = seq.seq.reverse_complement()
    seq.description = feature.qualifiers['product'][0]
    seqs.append(seq)

SeqIO.write(seqs, 'CDS.fasta', 'fasta')

Using Entrez Efetch is to be preferred in any case, as it would provide reliable source data without having to do any manipulation.

ADD COMMENT • link 16 months ago by Alban Nabla ▴ 30

score 2 · Accepted Answer · 2022-10-24

Not python but you can use Entrezdirect to get the CDS sequences from the accessions you want (example truncated to save space).

$ esearch -db nuccore -query NC_018019.1 | efetch -format fasta_cds_na > CDS.fa

$ more CDS.fa
>lcl|NC_018019.1_cds_WP_014799813.1_1 [locus_tag=CP162_RS00005] [protein=chromosomal replication initiator protein DnaA] [protein_id=WP_014799813.1] [location=1..1812] [gbkey=CDS]
GTGTCGGAGGCTCCATCGACATGGAACGAGCGGTGGCAAGAAGTTACTAATGAGCTGCTGTCACAGTCTC
AGGACCCAGAAAGTGGTATTTCCATTACGCGACAGCAAAGCGCCTACCTGCGTCTGGTTAAACCAGTTGC
TTTTGTAGAGGGTATTGCCGTTTTAAGCGTTCCTCACGCCCGAGCGAAAAAAGAGATTGAAACTACGCTG
GGACCTGTTATCACAGAGGTATTGTCTCGTAGACTAGGTCGACAATACAGTCTTGCAGTGAGCGTTCATG
CTCCAGAGGAAAATCCAGAAGTATCCTCGGCCACTCCAGATGCTGTGTCTGATTACCAGGAACAATCTGC