Question

SeqIO.parse reading a 200bp FASTA sequence as 132bp

0

Entering edit mode

6.8 years ago

rmartson ▴ 30

Here is the relevant part of the function:

flanks = open("flanks" + str(organism) + ".fasta", "w")
#[...]
for record in flanks: 
    if len(record.seq)<200:
        print record.id, str(len(record.seq))+"bp", "unflanked element"

This returns: NT_086568.2127 132bp unflanked element

However, if I find the relevant ID in the file called by hits I find an undeniably 200bp FASTA sequence in its usual format amongst hundreds of other 200bp FASTA sequences that are never mentioned.

If I alter the script to also return record.seq:

for record in hits: 
    if len(record.seq)<200:
        print record.id, record.seq str(len(record.seq))+"bp", "unflanked element"

And then ctrl+f this sequence in the relevant file I see this: http://i.imgur.com/eIE1Yqb.png

I think this shows the sequence isn't 200bp.

Biopython • 1.9k views

ADD COMMENT • link updated 6.7 years ago by Peter 6.0k • written 6.8 years ago by rmartson ▴ 30

0

Entering edit mode

The first bit of code you posted can't be correct. You're iterating over lines in a file that's empty (you've just opened it for writing). I assume it should have been something like

flanks = SeqIO.parse("flanks{}.fasta".format(organism))
for record in flanks:
    ...

ADD REPLY • link 6.8 years ago by Devon Ryan 104k

0

Entering edit mode

I cut out the rest. I said the file is full of 200bp sequences when I open it in a text editor so why would I be lying? I just don't want anyone to have to read through the other stuff. I had also edited some of the variable names.

Here's the entire function. The relevant part is "for record in sltrs:" onwards.

def flankfind(organism):
from Bio import SeqIO
flist = list("")
completes = list(SeqIO.parse("complete" + str(organism) + ".fasta", "fasta"))
hits = list(SeqIO.parse("hit" + str(organism) + ".fasta", "fasta"))
flanks = open("flanks" + str(organism) + ".fasta", "w")
fullflanks = open("fullflanks" + str(organism) + ".fasta", "w")

for hit_record1 in hits:
    for hit_record2 in hits:
        for complete_record in completes:
            if hit_record1.id == complete_record.id and hit_record2.id == complete_record.id):
                start1 = complete_record.seq.find(hit_record1.seq)
                start2 = complete_record.seq.find(hit_record2.seq)              
                if (7000 <  start2-start1 and start2-start1 < 8600):
                    flist.append(hit_record1.seq)
                    flist.append(hit_record2.seq)
                    if(start1>100 and (len(complete_record.seq)-start2)>100): 
                        print >> fullflanks, '>' + strcomplete_record.id)
                        print >> fullflanks, complete_record.seq[start1-100:start1]+complete_record.seq[start2:start2+100]



for index, hit_record in enumerate(hits):
    for complete_record in completes:
        if hit_record.id == complete_record.id and hit_record.seq not in flist):
            start = complete_record.seq.find(hit_record.seq)
            if (start>100 and (len(complete_record.seq)-(start+len(hit_record.seq)))>100):
                print >> flanks, '>' + strhit_record.id) + str(index)
                print >> flanks,  complete_record.seq[start-100:start]+complete_record.seq[start+len(hit_record):start+len(hit_record)+100]

sltrs = list(SeqIO.parse("flanks" + str(organism) + ".fasta", "fasta"))
fulls = list(SeqIO.parse("fullflanks" + str(organism) + ".fasta", "fasta"))

print "Total hits:"
print len(hits)
print "LTR elements with flanks:"
print len(sltrs)+len(fulls)
print "Single LTRs with flanks:"
print len(sltrs)
print "Full-length elements with flanks:"
print len(fulls)

for record in sltrs:
    if len(record.seq)<200:
        print record.id, str(len(record.seq))+"bp", "unflanked sLTR"

for record in fulls:
    if len(record.seq)<200:
        print record.id, str(len(record.seq))+"bp", "unflanked full-length element"

ADD REPLY • link 6.8 years ago by rmartson ▴ 30

0

Entering edit mode

I'm now also getting the issue of the function not correctly reading the number of results in my fullflanks(organism).fasta file. If I use len(list(SeqIO.parse([...] outside of the function, I get the correct number (16). If I use it inside the function, I get 0 even though I can open the fasta file and see that it has 16 results in it.

ADD REPLY • link 6.8 years ago by rmartson ▴ 30

0

Entering edit mode

At the very least, try closing flanks before opening it again, since it's likely that the contents haven't been flushed to disk yet.

ADD REPLY • link 6.8 years ago by Devon Ryan 104k

1

Entering edit mode

Okay, I used flanks.close() and fullflanks.close() after the entire flank retrieval part and they ended up being read correctly by SeqIO.parse afterwards. I'm not getting any 200bp sequences being read incorrectly and I get a length of 16 for the fullflanks list.

Thanks a lot!

ADD REPLY • link 6.8 years ago by rmartson ▴ 30

score 1 · Answer 1 · 2017-08-01

As per the comments, there was a bug in the user script not explicitly closing the output handles before trying to parse the same file. This can result in data "missing" as it has not yet been flushed to disk.

Python introduced the with statement to help with automating closing handles, since forgetting to close a handle is an easy mistake to make.