Question: SeqIO.parse reading a 200bp FASTA sequence as 132bp
gravatar for rmartson
2.9 years ago by
rmartson30 wrote:

Here is the relevant part of the function:

flanks = open("flanks" + str(organism) + ".fasta", "w")
for record in flanks: 
    if len(record.seq)<200:
        print, str(len(record.seq))+"bp", "unflanked element"

This returns: NT_086568.2127 132bp unflanked element

However, if I find the relevant ID in the file called by hits I find an undeniably 200bp FASTA sequence in its usual format amongst hundreds of other 200bp FASTA sequences that are never mentioned.

If I alter the script to also return record.seq:

for record in hits: 
    if len(record.seq)<200:
        print, record.seq str(len(record.seq))+"bp", "unflanked element"

And then ctrl+f this sequence in the relevant file I see this:

I think this shows the sequence isn't 200bp.

biopython • 839 views
ADD COMMENTlink modified 2.8 years ago by Peter5.8k • written 2.9 years ago by rmartson30

The first bit of code you posted can't be correct. You're iterating over lines in a file that's empty (you've just opened it for writing). I assume it should have been something like

flanks = SeqIO.parse("flanks{}.fasta".format(organism))
for record in flanks:
ADD REPLYlink written 2.9 years ago by Devon Ryan95k

I cut out the rest. I said the file is full of 200bp sequences when I open it in a text editor so why would I be lying? I just don't want anyone to have to read through the other stuff. I had also edited some of the variable names.

Here's the entire function. The relevant part is "for record in sltrs:" onwards.

def flankfind(organism):
from Bio import SeqIO
flist = list("")
completes = list(SeqIO.parse("complete" + str(organism) + ".fasta", "fasta"))
hits = list(SeqIO.parse("hit" + str(organism) + ".fasta", "fasta"))
flanks = open("flanks" + str(organism) + ".fasta", "w")
fullflanks = open("fullflanks" + str(organism) + ".fasta", "w")

for hit_record1 in hits:
    for hit_record2 in hits:
        for complete_record in completes:
            if == and ==
                start1 = complete_record.seq.find(hit_record1.seq)
                start2 = complete_record.seq.find(hit_record2.seq)              
                if (7000 <  start2-start1 and start2-start1 < 8600):
                    if(start1>100 and (len(complete_record.seq)-start2)>100): 
                        print >> fullflanks, '>' +
                        print >> fullflanks, complete_record.seq[start1-100:start1]+complete_record.seq[start2:start2+100]

for index, hit_record in enumerate(hits):
    for complete_record in completes:
        if == and hit_record.seq not in flist):
            start = complete_record.seq.find(hit_record.seq)
            if (start>100 and (len(complete_record.seq)-(start+len(hit_record.seq)))>100):
                print >> flanks, '>' + + str(index)
                print >> flanks,  complete_record.seq[start-100:start]+complete_record.seq[start+len(hit_record):start+len(hit_record)+100]

sltrs = list(SeqIO.parse("flanks" + str(organism) + ".fasta", "fasta"))
fulls = list(SeqIO.parse("fullflanks" + str(organism) + ".fasta", "fasta"))

print "Total hits:"
print len(hits)
print "LTR elements with flanks:"
print len(sltrs)+len(fulls)
print "Single LTRs with flanks:"
print len(sltrs)
print "Full-length elements with flanks:"
print len(fulls)

for record in sltrs:
    if len(record.seq)<200:
        print, str(len(record.seq))+"bp", "unflanked sLTR"

for record in fulls:
    if len(record.seq)<200:
        print, str(len(record.seq))+"bp", "unflanked full-length element"
ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by rmartson30

I'm now also getting the issue of the function not correctly reading the number of results in my fullflanks(organism).fasta file. If I use len(list(SeqIO.parse([...] outside of the function, I get the correct number (16). If I use it inside the function, I get 0 even though I can open the fasta file and see that it has 16 results in it.

ADD REPLYlink written 2.9 years ago by rmartson30

At the very least, try closing flanks before opening it again, since it's likely that the contents haven't been flushed to disk yet.

ADD REPLYlink written 2.9 years ago by Devon Ryan95k

Okay, I used flanks.close() and fullflanks.close() after the entire flank retrieval part and they ended up being read correctly by SeqIO.parse afterwards. I'm not getting any 200bp sequences being read incorrectly and I get a length of 16 for the fullflanks list.

Thanks a lot!

ADD REPLYlink written 2.9 years ago by rmartson30
gravatar for Peter
2.8 years ago by
Scotland, UK
Peter5.8k wrote:

As per the comments, there was a bug in the user script not explicitly closing the output handles before trying to parse the same file. This can result in data "missing" as it has not yet been flushed to disk.

Python introduced the with statement to help with automating closing handles, since forgetting to close a handle is an easy mistake to make.

ADD COMMENTlink written 2.8 years ago by Peter5.8k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2289 users visited in the last hour