Question

MCL Clustering for Large BLAST output files

0

Entering edit mode

9.5 years ago

thakurshalabh ▴ 70

Hello,

I have pairwise blast result for more than 400 bacterial genome each genome has more than 5000 sequences. I want to cluster the similarity information using MCL to identify protein families. However, when Im trying to run MCL clustering program on the output file. MCL is running out of memory even on machine with 24 GB RAM. The total size of BLAST output file is more than 300 GB after parsing out only best reciprocal hits. Can any one suggest me a way to perform this operation in better way?

Thank you

alignment blast • 4.6k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by thakurshalabh ▴ 70

0

Entering edit mode

Apply a higher e-value cutoff?

ADD REPLY • link 9.5 years ago by 5heikki 11k

0

Entering edit mode

The e-value cutoff is already high enough 1e-10. I'm trying to predict ortholog sequences.

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by thakurshalabh ▴ 70

0

Entering edit mode

Rerun blast with max_target_seqs 1 option

ADD REPLY • link 9.5 years ago by Renesh ★ 2.2k

0

Entering edit mode

Isn't this meant by "parsing out only best reciprocal hits"?

ADD REPLY • link 9.5 years ago by mikhail.shugay 3.5k

0

Entering edit mode

Yes I have already parsed best reciprocal matches. But still the pairwise similarity output is huge due to such a large number of genomes. I'm not sure if MCL is capable of handling such a large matrix. So I'm need of some suggesting to perform such operation

ADD REPLY • link 9.5 years ago by thakurshalabh ▴ 70

0

Entering edit mode

How did you parse the reciprocal matches?

ADD REPLY • link 5.4 years ago by amra.dhabalia • 0

0

Entering edit mode

How many lines are there in your blast output file? Best reciprocal hits from 400 bacterial genomes shouldn't translate to 300 GB, no way..

ADD REPLY • link 9.5 years ago by 5heikki 11k

Ram · Answer 1 · 2014-10-13

0

Entering edit mode

9.5 years ago

mikhail.shugay 3.5k

First, it is unclear how much nodes/edges do you have, 300Gb raw blast output doesn't tell anything.

What is the value of -I parameter you are using? Try setting a higher -I value, like 4 or 5 and see if MCL manages to finish. Also see "reducing node degree section" from here.

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by mikhail.shugay 3.5k