Question: MCL Clustering for Large BLAST output files
0
gravatar for thakurshalabh
4.6 years ago by
Canada
thakurshalabh60 wrote:

Hello,

I have pairwise blast result for more than 400 bacterial genome each genome has more than 5000 sequences. I want to cluster the similarity information using MCL to identify protein families. However, when Im trying to run MCL clustering program on the output file. MCL is running out of memory even on machine with 24 GB RAM. The total size of BLAST output file is more than 300 GB after parsing out only best reciprocal hits. Can any one suggest me a way to perform this operation in better way?

 

Thank you

blast alignment • 2.5k views
ADD COMMENTlink modified 2.9 years ago by Biostar ♦♦ 20 • written 4.6 years ago by thakurshalabh60

Apply a higher e-value cutoff?

ADD REPLYlink written 4.6 years ago by 5heikki8.4k

The e-value cutoff is already high enough 1e-10. I'm trying to predict ortholog sequences. 

ADD REPLYlink written 4.6 years ago by thakurshalabh60

Rerun blast with max_target_seqs 1 option

ADD REPLYlink written 4.6 years ago by Renesh1.6k

Isn't this meant by "parsing out only best reciprocal hits"?

ADD REPLYlink written 4.6 years ago by mikhail.shugay3.3k

Yes I have already parsed best reciprocal matches. But still the pairwise similarity output is huge due to such a large number of genomes. I'm not sure if MCL is capable of handling such a large matrix. So I'm need of some suggesting to perform such operation

ADD REPLYlink written 4.6 years ago by thakurshalabh60

How did you parse the reciprocal matches?

ADD REPLYlink written 6 months ago by amra.dhabalia0

How many lines are there in your blast output file? Best reciprocal hits from 400 bacterial genomes shouldn't translate to 300 GB, no way..

ADD REPLYlink written 4.6 years ago by 5heikki8.4k
0
gravatar for mikhail.shugay
4.6 years ago by
mikhail.shugay3.3k
Czech Republic, Brno, CEITEC
mikhail.shugay3.3k wrote:

First, it is unclear how much nodes/edges do you have, 300Gb raw blast output doesn't tell anything.

What is the value of -I parameter you are using? Try setting a higher -I value, like 4 or 5 and see if MCL manages to finish. Also see "reducing node degree section" from here http://micans.org/mcl/man/clmprotocols.html.

ADD COMMENTlink written 4.6 years ago by mikhail.shugay3.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1600 users visited in the last hour