Question

Biopython Seqio - Why Is It Repeating Deflines?

3

Entering edit mode

12.0 years ago

Cliff Beall ▴ 480

I was trying to write a script to go through a fasta file and break the sequences larger than a certain limit into smaller ones. I was trying to do this by reading in the file, slicing the sequence record if it was larger than my limit, then writing it back out. To keep the names unique I was trying to append "part1", etc. onto the id before writing. The result though, was for the sliced sequence the name is now repeated twice, with the first one having the part notation. Does anyone know why this is happening, or if there is a way to prevent it?

This is the script:

from Bio import SeqIO
import sys
inhandle = open(sys.argv[1], 'r')
outhandle = open(sys.argv[2], 'a')
for record in SeqIO.parse(inhandle, 'fasta'):
    if len(record) <= 100:
        SeqIO.write(record, outhandle, 'fasta')
    elif len(record) > 100:
        num_pieces = (len(record)/100) + 1
        for i in range(num_pieces):
            if i == 0:
                part_record = record[:len(record)/num_pieces]
            elif i == num_pieces - 1:
                part_record = record[-len(record)/num_pieces:]
            else:
                part_record = record[i * len(record)/num_pieces:(i + 1) * len(record)/num_pieces]
            part_record.id += '_part_%i' % (i + 1)
            SeqIO.write(part_record, outhandle, 'fasta')
inhandle.close()
outhandle.close()

Here is an example input and output fasta:

>SHORT_SEQ
TTCGCAAGATTCGATACCACTTTTTAGACAGCTATTCGTCCGGCAATCTATTCGCTATGA
>LONG_SEQ
GGCGGGTAGCAACGATTCATATCTGACGCTCAACTTCCACTCCGGGGCACATACGCTAGA
CTTAAAAACGGGTTTTGCATGGAAGGGACAAGCCGTGCGGTCCGCCCTGCAACCCCAACC
GGAGGGCGTGGAGGGTCCCTACACGAACGATCTTCGTTGGCGCCTATGGCAGCTATACGC
GGTGCCTGCTTATCGATTCGTGTATGGGGCTTGGAGCCTCACGTCGTCCGCGATGCTGGA

Gets output as:

>SHORT_SEQ
TTCGCAAGATTCGATACCACTTTTTAGACAGCTATTCGTCCGGCAATCTATTCGCTATGA
>LONG_SEQ_part_1 LONG_SEQ
GGCGGGTAGCAACGATTCATATCTGACGCTCAACTTCCACTCCGGGGCACATACGCTAGA
CTTAAAAACGGGTTTTGCAT
>LONG_SEQ_part_2 LONG_SEQ
GGAAGGGACAAGCCGTGCGGTCCGCCCTGCAACCCCAACCGGAGGGCGTGGAGGGTCCCT
ACACGAACGATCTTCGTTGG
>LONG_SEQ_part_3 LONG_SEQ
CGCCTATGGCAGCTATACGCGGTGCCTGCTTATCGATTCGTGTATGGGGCTTGGAGCCTC
ACGTCGTCCGCGATGCTGGA

I could pipe it through sed to get rid of the part after the space but I am curious about what exactly is going on, if anyone can enlighten me.

As suggested by Peter, here is a generator function. You can feed it the result of SeqIO.parse and call SeqIO.write on the result.

def chop_fasta(file_iter, maxsize):
    while True:
        record = file_iter.next()
        if len(record) <= maxsize:
            yield record
        else:
            num_pieces = (len(record)/maxsize) + 1
            for i in range(num_pieces):
                if i == 0:
                    part_record = record[:len(record)/num_pieces]
                elif i == num_pieces - 1:
                    part_record = record[-len(record)/num_pieces:]
                else:
                    part_record = record[i * len(record)/num_pieces:(i + 1) * len(record)/num_pieces]
                    part_record.id += '_part_%i' % (i + 1)
                    part_record.description = part_record.id
                    yield part_record

biopython fasta • 3.4k views

ADD COMMENT • link updated 12.0 years ago by bow ▴ 790 • written 12.0 years ago by Cliff Beall ▴ 480

2

Entering edit mode

It should be faster to make one call to SeqIO.write, in this case I would use a generator function based on your current code doing a yield with each record of interest.

ADD REPLY • link 12.0 years ago by Peter 6.0k

0

Entering edit mode

Thanks, I need to think more about generator functions, especially when dealing with large files. I appended some code to the question.

ADD REPLY • link 12.0 years ago by Cliff Beall ▴ 480

score 8 · Answer 1 · 2012-10-16

8

Entering edit mode

12.0 years ago

bow ▴ 790

Hi Cliff,

This seems to be a defined behavior of SeqIO, actually. You can 'fix' the behavior by appending this:

part_record.description = part_record.id

after this:

part_record.id += '_part_%i' % (i + 1)

This is because SeqIO only writes the description (everything after the first space) if it is different from the sequence ID. Initially, your sequence created a SeqRecord object with the same ID and description. But because you modified the ID, you need to change the description to be the same with the new ID to prevent SeqIO.write from appending the old description.

Hope this helps :)

EDIT: If you're curious, the relevant code is here: https://github.com/biopython/biopython/blob/master/Bio/SeqIO/FastaIO.py#L137

ADD COMMENT • link 12.0 years ago by bow ▴ 790

4

Entering edit mode

SeqIO is trying to preserve the (unmodified) description when you output the record as FASTA format. As an alternative to Bow's suggestion, just set the description to an empty string:

part_record.description = ""

ADD REPLY • link 12.0 years ago by Peter 6.0k

0

Entering edit mode

Thanks, that does the trick.

ADD REPLY • link 12.0 years ago by Cliff Beall ▴ 480

score 0 · Answer 2 · 2012-10-16

It looks like a bug with how BioPython writes out the record.ids. I usually try not to use their SeqIO.write that much actually.

You can do something like this:

from Bio import SeqIO
import sys
inhandle = open(sys.argv[1], 'r')
outhandle = open(sys.argv[2], 'a')

for record in SeqIO.parse(inhandle, 'fasta'):
    seq = str(record.seq)
    frags = [seq[i:i+100] for i in range(0,len(seq), 100)]
    if len(frags) == 1:
        outhandle.write(">" + record.description + "\n" + frags[0] + "\n")
    else:
        for i in range(len(frags)):
            outhandle.write(">" + record.description + "_part" + str(i+1) + "\n" + frags[i] + "\n")
inhandle.close()
outhandle.close()