Question: MultiProcessing on SeqIO biopython objects
0
gravatar for Maxime B
3.7 years ago by
Maxime B0
Norway
Maxime B0 wrote:

Hi,
I want to parse a big fasta file (253721 sequences) in order to extract one sequence per specie present in my list (which consists of 86859 different species).

I wrote some code which works great using BioPython SeqIO and regexp, but for the amount of sequences and species, it will take days to run...)

I started to look for some way of parallelizing the work using multiprocessing in python, however, I can't really connect the examples (for example here ) with my code...

Here is my not-parallelized code :


#!/usr/bin/env python
from Bio import Entrez
from Bio import SeqIO
import subprocess
import itertools
import re

def read_names (file_to_open) :
    name=[]
    names = open (file_to_open , "r")
    for i in names :
        name.append(i.rstrip('\n'))
    name=set(name)
    names.close()
    
    return(name)
    
names = read_names ("specie_to_ncbi.txt")
filename = "all_its.fa"
lowsize = 300
highsize = 1000

seqs=[]
seq_ids=[]
j = 0
for seq_record in (SeqIO.parse(filename, "fasta")) :
    j+=1
print "there are "+str(j)+" sequences in the fasta file"

i = 0
for seq_record in (SeqIO.parse(filename, "fasta")) :
    i+=1
    numb = str(round((i/j)*100,2))
    print numb+" % of the sequences already computed"
    longueur = (len(seq_record.seq))
    if (longueur > lowsize and longueur < highsize) :
         ident = (seq_record.description)
        for specie in names :
            if specie in str(ident) :
                myseq = str((seq_record.seq))
                seq_ids.append(str(seq_record.id)+"|"+str(specie))
                seqs.append(myseq)
                names.remove(specie)
                break
                

                
        
ITS =  open("very_simple_db_ITS.fa" , "w")
for i,s in zip(seq_ids, seqs) :
    ITS.write(">"+i)
    ITS.write("\n")
    ITS.write(s)
    ITS.write("\n")
ITS.close()
   

And here is my failed attempt to parallelize it using multiprocessing


#!/usr/bin/env python
from Bio import Entrez
from Bio import SeqIO
from multiprocessing import Pool
import subprocess
import itertools
import re

def read_names (file_to_open) :
    name=[]
    names = open (file_to_open , "r")
    for i in names :
        name.append(i.rstrip('\n'))
    name=set(name)
    names.close()
    
    return(name)
    
names = read_names ("head_specie_to_ncbi.txt")
filename = "head_all_its.fa"
lowsize = 300
highsize = 1000

def calc(seq_record , specie_names) :
    dictio = {}

    filename = "all_its.fa"

    longueur = (len(seq_record.seq))
    if (longueur > lowsize and longueur < highsize) :
         ident = (seq_record.description)
        for specie in specie_names :
            if specie in str(ident) :
                myseq = str((seq_record.seq))
                seq_ids.append(str(seq_record.id)+"|"+str(specie))
                dictio[str(ident)] = myseq
                break
    return dictio

if __name__ == '__main__':
    pool = Pool()
    mydict = pool.map(calc(SeqIO.parse(filename, "fasta"), names))
    print str(mydict)

 


Edit : I reduced the computing time to 20 minutes by replacing the regex with "word in string" python function, but still wondering about the multiprocessing...

biopython pool multiprocessing • 1.6k views
ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by Maxime B0

You might mention why it failed and with what exception. From what I see you are passing a function call to the map method, which expects an uncalled function and a sequence over which to map that function. 

ADD REPLYlink written 3.7 years ago by Matt Shirley8.9k

Also be aware that 'bee' in 'beetle' will return True as the string __contains__ method does substring matches and you probably are interested in word boundaries. 

ADD REPLYlink written 3.7 years ago by Matt Shirley8.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 792 users visited in the last hour