Hi,
I want to parse a big fasta file (253721 sequences) in order to extract one sequence per specie present in my list (which consists of 86859 different species).
I wrote some code which works great using BioPython SeqIO and regexp, but for the amount of sequences and species, it will take days to run...)
I started to look for some way of parallelizing the work using multiprocessing in python, however, I can't really connect the examples (for example here ) with my code...
Here is my not-parallelized code :
#!/usr/bin/env python from Bio import Entrez from Bio import SeqIO import subprocess import itertools import re def read_names (file_to_open) : name=[] names = open (file_to_open , "r") for i in names : name.append(i.rstrip('\n')) name=set(name) names.close() return(name) names = read_names ("specie_to_ncbi.txt") filename = "all_its.fa" lowsize = 300 highsize = 1000 seqs=[] seq_ids=[] j = 0 for seq_record in (SeqIO.parse(filename, "fasta")) : j+=1 print "there are "+str(j)+" sequences in the fasta file" i = 0 for seq_record in (SeqIO.parse(filename, "fasta")) : i+=1 numb = str(round((i/j)*100,2)) print numb+" % of the sequences already computed" longueur = (len(seq_record.seq)) if (longueur > lowsize and longueur < highsize) : ident = (seq_record.description) for specie in names : if specie in str(ident) : myseq = str((seq_record.seq)) seq_ids.append(str(seq_record.id)+"|"+str(specie)) seqs.append(myseq) names.remove(specie) break ITS = open("very_simple_db_ITS.fa" , "w") for i,s in zip(seq_ids, seqs) : ITS.write(">"+i) ITS.write("\n") ITS.write(s) ITS.write("\n") ITS.close()
And here is my failed attempt to parallelize it using multiprocessing
#!/usr/bin/env python from Bio import Entrez from Bio import SeqIO from multiprocessing import Pool import subprocess import itertools import re def read_names (file_to_open) : name=[] names = open (file_to_open , "r") for i in names : name.append(i.rstrip('\n')) name=set(name) names.close() return(name) names = read_names ("head_specie_to_ncbi.txt") filename = "head_all_its.fa" lowsize = 300 highsize = 1000 def calc(seq_record , specie_names) : dictio = {} filename = "all_its.fa" longueur = (len(seq_record.seq)) if (longueur > lowsize and longueur < highsize) : ident = (seq_record.description) for specie in specie_names : if specie in str(ident) : myseq = str((seq_record.seq)) seq_ids.append(str(seq_record.id)+"|"+str(specie)) dictio[str(ident)] = myseq break return dictio if __name__ == '__main__': pool = Pool() mydict = pool.map(calc(SeqIO.parse(filename, "fasta"), names)) print str(mydict)
Edit : I reduced the computing time to 20 minutes by replacing the regex with "word in string" python function, but still wondering about the multiprocessing...
You might mention why it failed and with what exception. From what I see you are passing a function call to the map method, which expects an uncalled function and a sequence over which to map that function.
Also be aware that 'bee' in 'beetle' will return True as the string __contains__ method does substring matches and you probably are interested in word boundaries.