Question: Mal-formed sequence line error in Bio.SeqIO
0
gravatar for stueckmann.daniel
8 weeks ago by
stueckmann.daniel0 wrote:

Hello,

I'm trying to use BioPython's SeqIO parser to load a ~3GB Genbank file with ~20,000 whole bacterial genomes and write .fasta files containing one sequence each. My code is working fine for the first ~2,000 genomes, but then it hits an invalid sequence line and terminates. The initial error message showed up as a BiopythonParserWarning, but after using the following code

from Bio import BiopythonParserWarning
warnings.simplefilter('ignore', BiopythonParserWarning)

it now results in a generic ValueError: Sequence line mal-formed 'title>NCBI/ffsrv11 - WWW Error 500 Diagnostic</title>' I can't find any info about this in the BioPython documentation, and I was wondering if there was any way to handle this error without storing the entire file in a list (takes a lot of RAM and my data sets will soon make this completely nonviable).

Here is the entirety of the code I am using:

from Bio import SeqIO
import warnings
from Bio import BiopythonParserWarning
warnings.simplefilter('ignore', BiopythonParserWarning)

#input file (genbank), change path
records = SeqIO.parse(r'C:\Users\aaa\aaa\file.gb', 'genbank')
#output folder(change path)
output_folder = r'C:\Users\\aaa\aaa\folder\\'

organism_dict = {}
for record in records:
    try:
        #create a file name in case we need to create a fasta file for this sequence, uses the SeqFeature qualifiers "organism" and "strain"
        doc_name = str(record.features[0].qualifiers["organism"][0]) + ' strain ' + str(record.features[0].qualifiers["strain"][0]) + '.fasta'
        #remove characters that are invalid for file names and/or newick strings, replace with spaces
        doc_name = doc_name.replace('(', ' ').replace(')', ' ').replace('/', ' ').replace('\\', ' ').replace(',', ' ')
        #use this doc name to see if this strain exists in our dictionary
        if doc_name in organism_dict:
            #check to see if the current sequence for this strain is shorter than our new sequence 
            if len(record.seq) > organism_dict[doc_name]:
                organism_dict[doc_name] = len(record.seq)
                with open(output_folder + doc_name, 'w') as handle:
                    SeqIO.write(record, handle, 'fasta')
        else:
            organism_dict[doc_name] = len(record.seq)
            with open(output_folder + doc_name, 'w') as handle:
                    SeqIO.write(record, handle, 'fasta')
    except:
        printrecord.id, record.features[0].qualifiers["organism"])

Thank you, Daniel

genbank biopython python • 131 views
ADD COMMENTlink written 8 weeks ago by stueckmann.daniel0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 654 users visited in the last hour