Mal-formed sequence line error in Bio.SeqIO
0
0
Entering edit mode
4.8 years ago

Hello,

I'm trying to use BioPython's SeqIO parser to load a ~3GB Genbank file with ~20,000 whole bacterial genomes and write .fasta files containing one sequence each. My code is working fine for the first ~2,000 genomes, but then it hits an invalid sequence line and terminates. The initial error message showed up as a BiopythonParserWarning, but after using the following code

from Bio import BiopythonParserWarning
warnings.simplefilter('ignore', BiopythonParserWarning)

it now results in a generic ValueError: Sequence line mal-formed 'title>NCBI/ffsrv11 - WWW Error 500 Diagnostic</title>' I can't find any info about this in the BioPython documentation, and I was wondering if there was any way to handle this error without storing the entire file in a list (takes a lot of RAM and my data sets will soon make this completely nonviable).

Here is the entirety of the code I am using:

from Bio import SeqIO
import warnings
from Bio import BiopythonParserWarning
warnings.simplefilter('ignore', BiopythonParserWarning)

#input file (genbank), change path
records = SeqIO.parse(r'C:\Users\aaa\aaa\file.gb', 'genbank')
#output folder(change path)
output_folder = r'C:\Users\\aaa\aaa\folder\\'

organism_dict = {}
for record in records:
    try:
        #create a file name in case we need to create a fasta file for this sequence, uses the SeqFeature qualifiers "organism" and "strain"
        doc_name = str(record.features[0].qualifiers["organism"][0]) + ' strain ' + str(record.features[0].qualifiers["strain"][0]) + '.fasta'
        #remove characters that are invalid for file names and/or newick strings, replace with spaces
        doc_name = doc_name.replace('(', ' ').replace(')', ' ').replace('/', ' ').replace('\\', ' ').replace(',', ' ')
        #use this doc name to see if this strain exists in our dictionary
        if doc_name in organism_dict:
            #check to see if the current sequence for this strain is shorter than our new sequence 
            if len(record.seq) > organism_dict[doc_name]:
                organism_dict[doc_name] = len(record.seq)
                with open(output_folder + doc_name, 'w') as handle:
                    SeqIO.write(record, handle, 'fasta')
        else:
            organism_dict[doc_name] = len(record.seq)
            with open(output_folder + doc_name, 'w') as handle:
                    SeqIO.write(record, handle, 'fasta')
    except:
        printrecord.id, record.features[0].qualifiers["organism"])

Thank you, Daniel

Python BioPython Genbank • 1.3k views
ADD COMMENT
0
Entering edit mode

Do you know exactly which record it is failing on? Can you slice a small portion of the GenBank file and load it somewhere for testing?

ADD REPLY

Login before adding your answer.

Traffic: 3670 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6