I was trying to write a script to go through a fasta file and break the sequences larger than a certain limit into smaller ones. I was trying to do this by reading in the file, slicing the sequence record if it was larger than my limit, then writing it back out. To keep the names unique I was trying to append "part1", etc. onto the id before writing. The result though, was for the sliced sequence the name is now repeated twice, with the first one having the part notation. Does anyone know why this is happening, or if there is a way to prevent it?
This is the script:
from Bio import SeqIO
import sys
inhandle = open(sys.argv[1], 'r')
outhandle = open(sys.argv[2], 'a')
for record in SeqIO.parse(inhandle, 'fasta'):
if len(record) <= 100:
SeqIO.write(record, outhandle, 'fasta')
elif len(record) > 100:
num_pieces = (len(record)/100) + 1
for i in range(num_pieces):
if i == 0:
part_record = record[:len(record)/num_pieces]
elif i == num_pieces - 1:
part_record = record[-len(record)/num_pieces:]
else:
part_record = record[i * len(record)/num_pieces:(i + 1) * len(record)/num_pieces]
part_record.id += '_part_%i' % (i + 1)
SeqIO.write(part_record, outhandle, 'fasta')
inhandle.close()
outhandle.close()
Here is an example input and output fasta:
>SHORT_SEQ
TTCGCAAGATTCGATACCACTTTTTAGACAGCTATTCGTCCGGCAATCTATTCGCTATGA
>LONG_SEQ
GGCGGGTAGCAACGATTCATATCTGACGCTCAACTTCCACTCCGGGGCACATACGCTAGA
CTTAAAAACGGGTTTTGCATGGAAGGGACAAGCCGTGCGGTCCGCCCTGCAACCCCAACC
GGAGGGCGTGGAGGGTCCCTACACGAACGATCTTCGTTGGCGCCTATGGCAGCTATACGC
GGTGCCTGCTTATCGATTCGTGTATGGGGCTTGGAGCCTCACGTCGTCCGCGATGCTGGA
Gets output as:
>SHORT_SEQ
TTCGCAAGATTCGATACCACTTTTTAGACAGCTATTCGTCCGGCAATCTATTCGCTATGA
>LONG_SEQ_part_1 LONG_SEQ
GGCGGGTAGCAACGATTCATATCTGACGCTCAACTTCCACTCCGGGGCACATACGCTAGA
CTTAAAAACGGGTTTTGCAT
>LONG_SEQ_part_2 LONG_SEQ
GGAAGGGACAAGCCGTGCGGTCCGCCCTGCAACCCCAACCGGAGGGCGTGGAGGGTCCCT
ACACGAACGATCTTCGTTGG
>LONG_SEQ_part_3 LONG_SEQ
CGCCTATGGCAGCTATACGCGGTGCCTGCTTATCGATTCGTGTATGGGGCTTGGAGCCTC
ACGTCGTCCGCGATGCTGGA
I could pipe it through sed to get rid of the part after the space but I am curious about what exactly is going on, if anyone can enlighten me.
As suggested by Peter, here is a generator function. You can feed it the result of SeqIO.parse and call SeqIO.write on the result.
def chop_fasta(file_iter, maxsize):
while True:
record = file_iter.next()
if len(record) <= maxsize:
yield record
else:
num_pieces = (len(record)/maxsize) + 1
for i in range(num_pieces):
if i == 0:
part_record = record[:len(record)/num_pieces]
elif i == num_pieces - 1:
part_record = record[-len(record)/num_pieces:]
else:
part_record = record[i * len(record)/num_pieces:(i + 1) * len(record)/num_pieces]
part_record.id += '_part_%i' % (i + 1)
part_record.description = part_record.id
yield part_record
It should be faster to make one call to SeqIO.write, in this case I would use a generator function based on your current code doing a yield with each record of interest.
Thanks, I need to think more about generator functions, especially when dealing with large files. I appended some code to the question.