Question: Biopython Seqio - Why Is It Repeating Deflines?
3
gravatar for Cliff Beall
5.1 years ago by
Cliff Beall440
Ohio
Cliff Beall440 wrote:

I was trying to write a script to go through a fasta file and break the sequences larger than a certain limit into smaller ones. I was trying to do this by reading in the file, slicing the sequence record if it was larger than my limit, then writing it back out. To keep the names unique I was trying to append "part1", etc. onto the id before writing. The result though, was for the sliced sequence the name is now repeated twice, with the first one having the part notation. Does anyone know why this is happening, or if there is a way to prevent it?

This is the script:

from Bio import SeqIO
import sys
inhandle = open(sys.argv[1], 'r')
outhandle = open(sys.argv[2], 'a')
for record in SeqIO.parse(inhandle, 'fasta'):
    if len(record) <= 100:
        SeqIO.write(record, outhandle, 'fasta')
    elif len(record) > 100:
        num_pieces = (len(record)/100) + 1
        for i in range(num_pieces):
            if i == 0:
                part_record = record[:len(record)/num_pieces]
            elif i == num_pieces - 1:
                part_record = record[-len(record)/num_pieces:]
            else:
                part_record = record[i * len(record)/num_pieces:(i + 1) * len(record)/num_pieces]
            part_record.id += '_part_%i' % (i + 1)
            SeqIO.write(part_record, outhandle, 'fasta')
inhandle.close()
outhandle.close()

Here is an example input and output fasta:

>SHORT_SEQ
TTCGCAAGATTCGATACCACTTTTTAGACAGCTATTCGTCCGGCAATCTATTCGCTATGA
>LONG_SEQ
GGCGGGTAGCAACGATTCATATCTGACGCTCAACTTCCACTCCGGGGCACATACGCTAGA
CTTAAAAACGGGTTTTGCATGGAAGGGACAAGCCGTGCGGTCCGCCCTGCAACCCCAACC
GGAGGGCGTGGAGGGTCCCTACACGAACGATCTTCGTTGGCGCCTATGGCAGCTATACGC
GGTGCCTGCTTATCGATTCGTGTATGGGGCTTGGAGCCTCACGTCGTCCGCGATGCTGGA

Gets output as:

>SHORT_SEQ
TTCGCAAGATTCGATACCACTTTTTAGACAGCTATTCGTCCGGCAATCTATTCGCTATGA
>LONG_SEQ_part_1 LONG_SEQ
GGCGGGTAGCAACGATTCATATCTGACGCTCAACTTCCACTCCGGGGCACATACGCTAGA
CTTAAAAACGGGTTTTGCAT
>LONG_SEQ_part_2 LONG_SEQ
GGAAGGGACAAGCCGTGCGGTCCGCCCTGCAACCCCAACCGGAGGGCGTGGAGGGTCCCT
ACACGAACGATCTTCGTTGG
>LONG_SEQ_part_3 LONG_SEQ
CGCCTATGGCAGCTATACGCGGTGCCTGCTTATCGATTCGTGTATGGGGCTTGGAGCCTC
ACGTCGTCCGCGATGCTGGA

I could pipe it through sed to get rid of the part after the space but I am curious about what exactly is going on, if anyone can enlighten me.

As suggested by Peter, here is a generator function. You can feed it the result of SeqIO.parse and call SeqIO.write on the result.

def chop_fasta(file_iter, maxsize):
    while True:
        record = file_iter.next()
        if len(record) <= maxsize:
            yield record
        else:
            num_pieces = (len(record)/maxsize) + 1
            for i in range(num_pieces):
                if i == 0:
                    part_record = record[:len(record)/num_pieces]
                elif i == num_pieces - 1:
                    part_record = record[-len(record)/num_pieces:]
                else:
                    part_record = record[i * len(record)/num_pieces:(i + 1) * len(record)/num_pieces]
                    part_record.id += '_part_%i' % (i + 1)
                    part_record.description = part_record.id
                    yield part_record
fasta biopython • 1.4k views
ADD COMMENTlink modified 5.1 years ago by bow760 • written 5.1 years ago by Cliff Beall440
2

It should be faster to make one call to SeqIO.write, in this case I would use a generator function based on your current code doing a yield with each record of interest.

ADD REPLYlink written 5.1 years ago by Peter5.6k

Thanks, I need to think more about generator functions, especially when dealing with large files. I appended some code to the question.

ADD REPLYlink written 5.1 years ago by Cliff Beall440
8
gravatar for bow
5.1 years ago by
bow760
Netherlands
bow760 wrote:

Hi Cliff,

This seems to be a defined behavior of SeqIO, actually. You can 'fix' the behavior by appending this:

part_record.description = part_record.id

after this:

part_record.id += '_part_%i' % (i + 1)

This is because SeqIO only writes the description (everything after the first space) if it is different from the sequence ID. Initially, your sequence created a SeqRecord object with the same ID and description. But because you modified the ID, you need to change the description to be the same with the new ID to prevent SeqIO.write from appending the old description.

Hope this helps :)

EDIT: If you're curious, the relevant code is here: https://github.com/biopython/biopython/blob/master/Bio/SeqIO/FastaIO.py#L137

ADD COMMENTlink modified 5.1 years ago • written 5.1 years ago by bow760
4

SeqIO is trying to preserve the (unmodified) description when you output the record as FASTA format. As an alternative to Bow's suggestion, just set the description to an empty string:

part_record.description = ""
ADD REPLYlink written 5.1 years ago by Peter5.6k

Thanks, that does the trick.

ADD REPLYlink written 5.1 years ago by Cliff Beall440
0
gravatar for Damian Kao
5.1 years ago by
Damian Kao14k
USA
Damian Kao14k wrote:

It looks like a bug with how BioPython writes out the record.ids. I usually try not to use their SeqIO.write that much actually.

You can do something like this:

from Bio import SeqIO
import sys
inhandle = open(sys.argv[1], 'r')
outhandle = open(sys.argv[2], 'a')

for record in SeqIO.parse(inhandle, 'fasta'):
    seq = str(record.seq)
    frags = [seq[i:i+100] for i in range(0,len(seq), 100)]
    if len(frags) == 1:
        outhandle.write(">" + record.description + "\n" + frags[0] + "\n")
    else:
        for i in range(len(frags)):
            outhandle.write(">" + record.description + "_part" + str(i+1) + "\n" + frags[i] + "\n")
inhandle.close()
outhandle.close()
ADD COMMENTlink written 5.1 years ago by Damian Kao14k

Not a bug but deliberate so as to preserve the record's description and ID if possible. See Bow's answer.

ADD REPLYlink written 5.1 years ago by Peter5.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 965 users visited in the last hour