Question

Need to extract sequences from genome fasta (SeqIO object) into output file using for loop. How do I return to the start of the genome with every iteration?

0

Entering edit mode

6.8 years ago

rmartson • 0

So I need to use a for loop to search through the genome for a specific chromosome, then extract sequences from that chromosome. however that only allows one run. I usually load fastas into memory using list(), but these things are around 3GB each so I can't do that. There's also nothing like seek() for a SeqIO object, so how do I return to the start of the genome for each iteration? Or alternatively, is there a much more efficient way of doing this?

Here are the relevant parts of my code (there are some bits cut out in between):

#sid is defined as the id of the chromosome a hit is found in in a "tabular with comments" BLAST output file
#start and end are the s.start and s.end of the hit within the chromosome
#genomehandle1 and organism1 are defined by input at the start of the script

genome1 = SeqIO.parse(str(genomehandle1)+".fa", "fasta")
flanks1 = open("flanks"+str(organism1)+".fa", "w")
#[...]
for chromosome in genome1:
    if sid == chromosome.id:
        print >> flanks1, ">"+str(qid)
        print >> flanks1, str(chromosome.seq[start-100:start])+str(chromosome.seq[end:end+100])
        break
#the for loop is reinitiated by another for loop above it

The "break" should just prevent the for loop I've pasted here from running beyond the relevant chromosome, right?

Currently I am getting output into flanks1, but only a single fasta sequence at the expected length when I'm expecting hundreds so I think it's just not finding any more sid == chromosome.id instances because it initiates the search from the end of the genome fasta.

Biopython blast • 2.2k views

ADD COMMENT • link updated 6.8 years ago by James Ashmore ★ 3.4k • written 6.8 years ago by rmartson • 0

score 0 · Answer 1 · 2017-07-19

0

Entering edit mode

6.8 years ago

James Ashmore ★ 3.4k

Make yourself a BED file containing the regions you want to extract and then try the bedtools getfasta command:

bedtools getfasta [OPTIONS] -fi <FASTA> -bed <BED>

ADD COMMENT • link 6.8 years ago by James Ashmore ★ 3.4k

0

Entering edit mode

eh I just loaded them to memory and it was fine

thanks though

ADD REPLY • link 6.8 years ago by rmartson • 0