Question

How can I obtain the DNA sequences of each CDS for several genbank files?

0

Entering edit mode

4 months ago

Raphaëlle • 0

Hello,

I want to obtain DNA sequences of all the CDS from multiple genbank files in one fasta file. I tried several solutions with Biopython but nothing is working for me.

I tried for example the following code:

from Bio import SeqIO

genome = next(SeqIO.parse("NC_018019.2.gb", "genbank"))

cds = []
for feature in genome.features:
    if feature.type == 'CDS':
    cds.append(feature)

seqs = []
for feature in cds:
    seq = genome[int(feature.location.start):int(feature.location.end)]
    if feature.location.strand == -1: 
        seq.seq = seq.seq.reverse_complement()
    seq.description = feature.qualifiers['product'][0]
    seqs.append(seq)

SeqIO.write(seqs, 'CDS.fasta', 'fasta')

But on line 4 I got the error IndentationError: expected an indented block after 'if' statement on line 2.

I am new with python, so I don't know how to make it works.

Also this solution was quite old and I am working on Python 3.12.0 and Biopython 1.81 (can get also 1.76).

Maybe this is an issue of versions.

Here is an example of genbank file:

enter image description here

Ideally, I would like to obtain all the DNA sequences from all the genbank files for all the CDS and another for all the "integrase".

Thank you very much for your help.

genbank DNA sequence • 1.6k views

ADD COMMENT • link 4 months ago by Raphaëlle • 0

score 0 · Answer 1 · 2023-12-08

0

Entering edit mode

4 months ago

Ram 43k

This is a simple python issue. You're missing indentation after your first if:

cds = []
for feature in genome.features:
    if feature.type == 'CDS':
    cds.append(feature)

should be

cds = []
for feature in genome.features:
    if feature.type == 'CDS':
        cds.append(feature)

ADD COMMENT • link 4 months ago by Ram 43k

0

Entering edit mode

Hello, Thank you very much for your quick reply. That was indeed the issue!

Another question: is there a simple way of automating this script so that it applies to all gbk files?

ADD REPLY • link 4 months ago by Raphaëlle • 0

0

Entering edit mode

Look into passing command line parameters to a python script. You can pass the input file name as a parameter to the script and execute the script from the command line. You'd also need to set the output file name dynamically (using the input file name to derive it, for example).

An alternative is to wrap the code in a function and call the function using a simple loop or list comprehension.

ADD REPLY • link 4 months ago by Ram 43k

0

Entering edit mode

Thanks again for your reply. I'm not used to python so I don't really know how to do that, but I will search how to put the input file name as a paremeter.

ADD REPLY • link 4 months ago by Raphaëlle • 0

0

Entering edit mode

Here's a hint and some pseudocode. There are more robust ways of doing it but this is a quick and dirty solution.

import sys

files = sys.argv[1:]
for file in files:
    genome = SeqIO.read(file, "genbank"))

# do rest of script

ADD REPLY • link 4 months ago by Joe 21k

0

Entering edit mode

Thank you, I see it just now

ADD REPLY • link 4 months ago by Raphaëlle • 0

0

Entering edit mode

I tried adding this part to my code with also

files = sys.argv[1:]
for file in files:
    genome = next(SeqIO.parse(file, "genbank"))

But for the next step of my script I've got the following error "NameError: name 'genome' is not defined

ADD REPLY • link 4 months ago by Raphaëlle • 0

0

Entering edit mode

I suggest you don't use parse. You are making life complicated for yourself with the use of next etc.

Have a look at the biopython manual to understand the difference.

ADD REPLY • link 4 months ago by Joe 21k

0

Entering edit mode

Okay. I have tried it also, but this time I changed the way to call the script. In the terminal I write:

python /work/gbk_script_test3.py /work/data_directory

My script is the following:

from Bio import SeqIO
import sys


files = sys.argv[1:]

for file in files:
    genome = SeqIO.read(file, "genbank")

cds = []
for feature in genome.features:
    if feature.type == 'CDS':
        cds.append(feature)

seqs = []
for feature in cds:
    seq = genome[int(feature.location.start):int(feature.location.end)]
    if feature.location.strand == -1: 
         seq.seq = seq.seq.reverse_complement()
         seq.description = feature.qualifiers['protein_id'][0]
         seqs.append(seq)

SeqIO.write(seqs, 'test_all_gbk.fasta', 'fasta')

but I get the following error:

Traceback (most recent call last):
  File "/work/rpeguilhan/gbk_script_test3.py", line 8, in <module>
    genome = SeqIO.read(file, "genbank")
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/softs/contrib/apps/anaconda/3/envs/biopython-1.81/lib/python3.12/site-packages/Bio/SeqIO/__init__.py", line 652, in read
    iterator = parse(handle, format, alphabet)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/softs/contrib/apps/anaconda/3/envs/biopython-1.81/lib/python3.12/site-packages/Bio/SeqIO/__init__.py", line 605, in parse
    return iterator_generator(handle)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/softs/contrib/apps/anaconda/3/envs/biopython-1.81/lib/python3.12/site-packages/Bio/SeqIO/InsdcIO.py", line 97, in __init__
    super().__init__(source, mode="t", fmt="GenBank")
  File "/softs/contrib/apps/anaconda/3/envs/biopython-1.81/lib/python3.12/site-packages/Bio/SeqIO/Interfaces.py", line 45, in __init__
    self.stream = open(source, "r" + mode)
                  ^^^^^^^^^^^^^^^^^^^^^^^^
IsADirectoryError: [Errno 21] Is a directory: '/work/data_directory'

Sorry, I'm not really used to python. This is the first time I've used it because I haven't found any other solution for recovering nucleic acid CDS sequences.

ADD REPLY • link 4 months ago by Raphaëlle • 0

0

Entering edit mode

You are passing it a directory, when it is expecting a list of filenames. Hence the error is telling you you've given it a directory.

Run the code as python scriptname.py /path/to/files/*.gbk instead

ADD REPLY • link 4 months ago by Joe 21k

0

Entering edit mode

Thanks a lot. I thought my script already took the gbk files and I just had to specify a directory. Anyway, it works now. Only one last problem: my last line to get a file with all the sequences is probably wrong. I only got the CDS sequences from the last .gbk file in my directory.

I have found scripts to create different files depending on each .gbk file, but in my case I want a single final fasta file with all the sequences from all the .gbk files. Is this possible?

ADD REPLY • link 4 months ago by Raphaëlle • 0

0

Entering edit mode

You are defining the seqs list during the loop, so it is overwritten each time the loop runs (once per file). You need to append the sequences to the list inside the loop as you currently do, but you need to declare the list just once outside the loop.

Then the list will be appended with the CDSs from each file.

ADD REPLY • link 4 months ago by Joe 21k

0

Entering edit mode

Thanks Sorry, but again I'm not used to python, so I'm not sure I understand your suggestion. Do you mean that I should put the command line "SeqIO.write(seqs, 'test_all_gbk.fasta', 'fasta')" inside the loop? I've tried this option but to no success. I must have forgotten some of the steps. And how do I "declare the list" outside the loop?

ADD REPLY • link 4 months ago by Raphaëlle • 0

0

Entering edit mode

An other issue I found today: my script is not working for all cds sequences. Actually it takes only the "complement" ones and not the others. I guess there is something to do with the line "seq.seq = seq.seq.reverse_complement()", however, I don't know how to allow both (complement and 'normal' sequences).

ADD REPLY • link 4 months ago by Raphaëlle • 0

0

Entering edit mode

Sorry, It was actually quite easy, I added an "else" line.

ADD REPLY • link 4 months ago by Raphaëlle • 0

0

Entering edit mode

I've tried to adapt a solution to my script, which is as follows:

from Bio import SeqIO
import os
from glob import glob

file_directory= '/work/my_directory'

file_names=glob('%s/*.gbk'%file_directory)

for file_ in file_names:
    genome = next(SeqIO.parse(file_, "genbank"))
    cds = []
    for feature in genome.features:
        if feature.type == 'CDS':
            cds.append(feature)
    seqs = []
    for feature in cds:
        seq = genome[int(feature.location.start):int(feature.location.end)]
        if feature.location.strand == -1: 
            seq.seq = seq.seq.reverse_complement()
            seq.description = feature.qualifiers['protein_id'][0]
            seqs.append(seq)

SeqIO.write(seqs, 'test_all_gbk.fasta', 'fasta')

but I can't get any sequences

ADD REPLY • link 4 months ago by Raphaëlle • 0

0

Entering edit mode

We can't help you debug the script if you don't tell us what the actual problem is. You need to share the exact errors.

ADD REPLY • link 4 months ago by Joe 21k

0

Entering edit mode

After the loop (for file_...) I get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'files' is not defined. Did you mean: 'filter'?

ADD REPLY • link 4 months ago by Raphaëlle • 0

0

Entering edit mode

That doesn't add up - are you sure the code you've pasted here matches the code you're running?

ADD REPLY • link 4 months ago by Ram 43k

0

Entering edit mode

You don't use the variable files anywhere in that script, so naturally, it is not defined.

I think you need to proof-read your script very closely and understand what each element is trying to do. You've also posted 2 different versions for which you're now getting 2 different errors, so be careful you aren't getting mixed up.

ADD REPLY • link 4 months ago by Joe 21k