So I need to use a for loop to search through the genome for a specific chromosome, then extract sequences from that chromosome. however that only allows one run. I usually load fastas into memory using list(), but these things are around 3GB each so I can't do that. There's also nothing like seek() for a SeqIO object, so how do I return to the start of the genome for each iteration? Or alternatively, is there a much more efficient way of doing this?
Here are the relevant parts of my code (there are some bits cut out in between):
#sid is defined as the id of the chromosome a hit is found in in a "tabular with comments" BLAST output file #start and end are the s.start and s.end of the hit within the chromosome #genomehandle1 and organism1 are defined by input at the start of the script genome1 = SeqIO.parse(str(genomehandle1)+".fa", "fasta") flanks1 = open("flanks"+str(organism1)+".fa", "w") #[...] for chromosome in genome1: if sid == chromosome.id: print >> flanks1, ">"+str(qid) print >> flanks1, str(chromosome.seq[start-100:start])+str(chromosome.seq[end:end+100]) break #the for loop is reinitiated by another for loop above it
The "break" should just prevent the for loop I've pasted here from running beyond the relevant chromosome, right?
Currently I am getting output into flanks1, but only a single fasta sequence at the expected length when I'm expecting hundreds so I think it's just not finding any more sid == chromosome.id instances because it initiates the search from the end of the genome fasta.