Extract all CDS from genbank file
2
0
Entering edit mode
3 months ago

Hello, everyone!

I'd like to run OrthoFinder with nucleotide input instead of protein input, but the genomes I'm working with are from an old version, then Genbank doesn't allow me to download de CDS fasta file directly from the site. So I downloaded the gbk file and I'm trying, without success, to extract all CDS from it. If someone could help me I'd appreciate it.

Those are the gb files: https://www.ncbi.nlm.nih.gov/nuccore/NC_018019.2 https://www.ncbi.nlm.nih.gov/nuccore/NC_018019.1

from Bio import SeqIO

record = SeqIO.parse(
    "./NC_018019.2-complete.gb", "genbank")

record.features = [f for f in record.features if f.type == "CDS"]

SeqIO.write(cds, 'NC_01809.2_V2_nucleotide.fasta', 'fasta')

and i've the following error:

Traceback (most recent call last): File "/home/enrico/Documentos/materialcp162/dados/CDSs/nucleotideos/./getcds.py", line 7, in <module>
    record.features = [f for f in record.features if f.type == "CDS"]
AttributeError: 'GenBankIterator' object has no attribute 'features'

Thank you

fasta biopython genbank cds python • 604 views
ADD COMMENT
2
Entering edit mode
3 months ago
GenoMax 125k

Not python but you can use Entrezdirect to get the CDS sequences from the accessions you want (example truncated to save space).

$ esearch -db nuccore -query NC_018019.1 | efetch -format fasta_cds_na > CDS.fa

$ more CDS.fa
>lcl|NC_018019.1_cds_WP_014799813.1_1 [locus_tag=CP162_RS00005] [protein=chromosomal replication initiator protein DnaA] [protein_id=WP_014799813.1] [location=1..1812] [gbkey=CDS]
GTGTCGGAGGCTCCATCGACATGGAACGAGCGGTGGCAAGAAGTTACTAATGAGCTGCTGTCACAGTCTC
AGGACCCAGAAAGTGGTATTTCCATTACGCGACAGCAAAGCGCCTACCTGCGTCTGGTTAAACCAGTTGC
TTTTGTAGAGGGTATTGCCGTTTTAAGCGTTCCTCACGCCCGAGCGAAAAAAGAGATTGAAACTACGCTG
GGACCTGTTATCACAGAGGTATTGTCTCGTAGACTAGGTCGACAATACAGTCTTGCAGTGAGCGTTCATG
CTCCAGAGGAAAATCCAGAAGTATCCTCGGCCACTCCAGATGCTGTGTCTGATTACCAGGAACAATCTGC
ADD COMMENT
0
Entering edit mode

Many thanks! It worked!

ADD REPLY
0
Entering edit mode
9 weeks ago
Alban Nabla ▴ 20

This was already neatly solved above, but I thought it could be a good exercise to show how to do what requested in BioPython:

from Bio import SeqIO

genome = next(SeqIO.parse("NC_018019.2.gb", "genbank"))

cds = []
for feature in genome.features:
    if feature.type == 'CDS':
    cds.append(feature)

seqs = []
for feature in cds:
    seq = genome[int(feature.location.start):int(feature.location.end)]
    if feature.location.strand == -1: 
        seq.seq = seq.seq.reverse_complement()
    seq.description = feature.qualifiers['product'][0]
    seqs.append(seq)

SeqIO.write(seqs, 'CDS.fasta', 'fasta')

Using Entrez Efetch is to be preferred in any case, as it would provide reliable source data without having to do any manipulation.

ADD COMMENT

Login before adding your answer.

Traffic: 3174 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6