Question: SeqIO.parse reading a 200bp FASTA sequence as 132bp
0
gravatar for rmartson
22 months ago by
rmartson20
rmartson20 wrote:

Here is the relevant part of the function:

flanks = open("flanks" + str(organism) + ".fasta", "w")
#[...]
for record in flanks: 
    if len(record.seq)<200:
        print record.id, str(len(record.seq))+"bp", "unflanked element"

This returns: NT_086568.2127 132bp unflanked element

However, if I find the relevant ID in the file called by hits I find an undeniably 200bp FASTA sequence in its usual format amongst hundreds of other 200bp FASTA sequences that are never mentioned.

If I alter the script to also return record.seq:

for record in hits: 
    if len(record.seq)<200:
        print record.id, record.seq str(len(record.seq))+"bp", "unflanked element"

And then ctrl+f this sequence in the relevant file I see this: http://i.imgur.com/eIE1Yqb.png

I think this shows the sequence isn't 200bp.

biopython • 592 views
ADD COMMENTlink modified 21 months ago by Peter5.8k • written 22 months ago by rmartson20

The first bit of code you posted can't be correct. You're iterating over lines in a file that's empty (you've just opened it for writing). I assume it should have been something like

flanks = SeqIO.parse("flanks{}.fasta".format(organism))
for record in flanks:
    ...
ADD REPLYlink written 22 months ago by Devon Ryan90k

I cut out the rest. I said the file is full of 200bp sequences when I open it in a text editor so why would I be lying? I just don't want anyone to have to read through the other stuff. I had also edited some of the variable names.

Here's the entire function. The relevant part is "for record in sltrs:" onwards.

def flankfind(organism):
from Bio import SeqIO
flist = list("")
completes = list(SeqIO.parse("complete" + str(organism) + ".fasta", "fasta"))
hits = list(SeqIO.parse("hit" + str(organism) + ".fasta", "fasta"))
flanks = open("flanks" + str(organism) + ".fasta", "w")
fullflanks = open("fullflanks" + str(organism) + ".fasta", "w")

for hit_record1 in hits:
    for hit_record2 in hits:
        for complete_record in completes:
            if hit_record1.id == complete_record.id and hit_record2.id == complete_record.id):
                start1 = complete_record.seq.find(hit_record1.seq)
                start2 = complete_record.seq.find(hit_record2.seq)              
                if (7000 <  start2-start1 and start2-start1 < 8600):
                    flist.append(hit_record1.seq)
                    flist.append(hit_record2.seq)
                    if(start1>100 and (len(complete_record.seq)-start2)>100): 
                        print >> fullflanks, '>' + strcomplete_record.id)
                        print >> fullflanks, complete_record.seq[start1-100:start1]+complete_record.seq[start2:start2+100]



for index, hit_record in enumerate(hits):
    for complete_record in completes:
        if hit_record.id == complete_record.id and hit_record.seq not in flist):
            start = complete_record.seq.find(hit_record.seq)
            if (start>100 and (len(complete_record.seq)-(start+len(hit_record.seq)))>100):
                print >> flanks, '>' + strhit_record.id) + str(index)
                print >> flanks,  complete_record.seq[start-100:start]+complete_record.seq[start+len(hit_record):start+len(hit_record)+100]

sltrs = list(SeqIO.parse("flanks" + str(organism) + ".fasta", "fasta"))
fulls = list(SeqIO.parse("fullflanks" + str(organism) + ".fasta", "fasta"))

print "Total hits:"
print len(hits)
print "LTR elements with flanks:"
print len(sltrs)+len(fulls)
print "Single LTRs with flanks:"
print len(sltrs)
print "Full-length elements with flanks:"
print len(fulls)

for record in sltrs:
    if len(record.seq)<200:
        print record.id, str(len(record.seq))+"bp", "unflanked sLTR"

for record in fulls:
    if len(record.seq)<200:
        print record.id, str(len(record.seq))+"bp", "unflanked full-length element"
ADD REPLYlink modified 22 months ago • written 22 months ago by rmartson20

I'm now also getting the issue of the function not correctly reading the number of results in my fullflanks(organism).fasta file. If I use len(list(SeqIO.parse([...] outside of the function, I get the correct number (16). If I use it inside the function, I get 0 even though I can open the fasta file and see that it has 16 results in it.

ADD REPLYlink written 22 months ago by rmartson20

At the very least, try closing flanks before opening it again, since it's likely that the contents haven't been flushed to disk yet.

ADD REPLYlink written 22 months ago by Devon Ryan90k
1

Okay, I used flanks.close() and fullflanks.close() after the entire flank retrieval part and they ended up being read correctly by SeqIO.parse afterwards. I'm not getting any 200bp sequences being read incorrectly and I get a length of 16 for the fullflanks list.

Thanks a lot!

ADD REPLYlink written 22 months ago by rmartson20
1
gravatar for Peter
21 months ago by
Peter5.8k
Scotland, UK
Peter5.8k wrote:

As per the comments, there was a bug in the user script not explicitly closing the output handles before trying to parse the same file. This can result in data "missing" as it has not yet been flushed to disk.

Python introduced the with statement to help with automating closing handles, since forgetting to close a handle is an easy mistake to make.

ADD COMMENTlink written 21 months ago by Peter5.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1104 users visited in the last hour