Question: Passing multiple fasta files to local blast
gravatar for New2programming
2.5 years ago by
New2programming0 wrote:

I'm looking to run a number of Fasta files simultaneously using local blast for a pipeline. Im using biopython to read in my input file and parse through for a specified number of sequences e.g. 1000 if the file is larger i batch it out into segments of 1000. But i'm now looking for a way be able to run each file through local blast rather than one at a time and then concatenating all the output files i receive for post-blast parsing by E-value.

def batch_iterator(iterator, batch_size):#generator function for splitting large files
    entry = True  # Make sure we loop once
    while entry:
        batch = []
        while len(batch) < batch_size:
                entry = iterator.__next__()
            except StopIteration:
                entry = None
            if entry is None:
                # End of file
        if batch:
            yield batch 

counter =0
for record in SeqIO.parse(Input_file,'fasta'):# pasrse input file for sequence #
    counter +=1
    if counter > 10:#if input file has more than 10 seqs file is batched  
        for i, batch in enumerate(batch_iterator(record_iter, 10)):
            filename = "batch_%i.fasta" % (i + 1)
            with open(filename, "w") as handle:
                count = SeqIO.write(batch, handle, "fasta")
            print("Write %i records to %s" % (count, filename))      

What would be the best way to automate this so i grab all my batch files and run them through local blast? Would i have to use ./blastp -db a_database -query queryfile.fasta -out blastoutpu.tsv -outfmt 6 for individual file name using the os(command) in my script or is there a simpler way?

ADD COMMENTlink written 2.5 years ago by New2programming0

Is there any specific reason you want to batch this analysis (eg. run it in parallel on a compute cluster) ? otherwise it will be more efficient to run the blast with one big input file

ADD REPLYlink written 2.5 years ago by lieven.sterck8.7k

I found this if that could help :

ADD REPLYlink written 2.5 years ago by Bastien Hervé4.8k

That link is actually where i started my query, i'm just wondering if there's a way to do it in python rather than bash.

ADD REPLYlink written 2.5 years ago by New2programming0

From 2008 (could be out of age) :

ADD REPLYlink written 2.5 years ago by Bastien Hervé4.8k

You could use the multiprocessing python module.

Alternatively, use subprocess to pass blast commands to GNU Parallel at the commandline. How you batch up the files before invoking either of these would be entirely up to you in the python script.

ADD REPLYlink written 2.5 years ago by Joe18k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1790 users visited in the last hour