MultiProcessing on SeqIO biopython objects
0
0
Entering edit mode
6.9 years ago
Maxime B • 0

Hi,
I want to parse a big fasta file (253721 sequences) in order to extract one sequence per specie present in my list (which consists of 86859 different species).

I wrote some code which works great using BioPython SeqIO and regexp, but for the amount of sequences and species, it will take days to run...)

I started to look for some way of parallelizing the work using multiprocessing in python, however, I can't really connect the examples (for example here ) with my code...

Here is my not-parallelized code :


#!/usr/bin/env python
from Bio import Entrez
from Bio import SeqIO
import subprocess
import itertools
import re

def read_names (file_to_open) :
    name=[]
    names = open (file_to_open , "r")
    for i in names :
        name.append(i.rstrip('\n'))
    name=set(name)
    names.close()
    
    return(name)
    
names = read_names ("specie_to_ncbi.txt")
filename = "all_its.fa"
lowsize = 300
highsize = 1000

seqs=[]
seq_ids=[]
j = 0
for seq_record in (SeqIO.parse(filename, "fasta")) :
    j+=1
print "there are "+str(j)+" sequences in the fasta file"

i = 0
for seq_record in (SeqIO.parse(filename, "fasta")) :
    i+=1
    numb = str(round((i/j)*100,2))
    print numb+" % of the sequences already computed"
    longueur = (len(seq_record.seq))
    if (longueur > lowsize and longueur < highsize) :
         ident = (seq_record.description)
        for specie in names :
            if specie in str(ident) :
                myseq = str((seq_record.seq))
                seq_ids.append(str(seq_record.id)+"|"+str(specie))
                seqs.append(myseq)
                names.remove(specie)
                break
                

                
        
ITS =  open("very_simple_db_ITS.fa" , "w")
for i,s in zip(seq_ids, seqs) :
    ITS.write(">"+i)
    ITS.write("\n")
    ITS.write(s)
    ITS.write("\n")
ITS.close()
   

And here is my failed attempt to parallelize it using multiprocessing


#!/usr/bin/env python
from Bio import Entrez
from Bio import SeqIO
from multiprocessing import Pool
import subprocess
import itertools
import re

def read_names (file_to_open) :
    name=[]
    names = open (file_to_open , "r")
    for i in names :
        name.append(i.rstrip('\n'))
    name=set(name)
    names.close()
    
    return(name)
    
names = read_names ("head_specie_to_ncbi.txt")
filename = "head_all_its.fa"
lowsize = 300
highsize = 1000

def calc(seq_record , specie_names) :
    dictio = {}

    filename = "all_its.fa"

    longueur = (len(seq_record.seq))
    if (longueur > lowsize and longueur < highsize) :
         ident = (seq_record.description)
        for specie in specie_names :
            if specie in str(ident) :
                myseq = str((seq_record.seq))
                seq_ids.append(str(seq_record.id)+"|"+str(specie))
                dictio[str(ident)] = myseq
                break
    return dictio

if __name__ == '__main__':
    pool = Pool()
    mydict = pool.map(calc(SeqIO.parse(filename, "fasta"), names))
    print str(mydict)

 


Edit : I reduced the computing time to 20 minutes by replacing the regex with "word in string" python function, but still wondering about the multiprocessing...

biopython multiprocessing pool • 3.9k views
ADD COMMENT
0
Entering edit mode

You might mention why it failed and with what exception. From what I see you are passing a function call to the map method, which expects an uncalled function and a sequence over which to map that function. 

ADD REPLY
0
Entering edit mode

Also be aware that 'bee' in 'beetle' will return True as the string __contains__ method does substring matches and you probably are interested in word boundaries. 

ADD REPLY

Login before adding your answer.

Traffic: 722 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6