MCL Clustering for Large BLAST output files
1
0
Entering edit mode
9.5 years ago

Hello,

I have pairwise blast result for more than 400 bacterial genome each genome has more than 5000 sequences. I want to cluster the similarity information using MCL to identify protein families. However, when Im trying to run MCL clustering program on the output file. MCL is running out of memory even on machine with 24 GB RAM. The total size of BLAST output file is more than 300 GB after parsing out only best reciprocal hits. Can any one suggest me a way to perform this operation in better way?

Thank you

alignment blast • 4.6k views
ADD COMMENT
0
Entering edit mode

Apply a higher e-value cutoff?

ADD REPLY
0
Entering edit mode

The e-value cutoff is already high enough 1e-10. I'm trying to predict ortholog sequences.

ADD REPLY
0
Entering edit mode

Rerun blast with max_target_seqs 1 option

ADD REPLY
0
Entering edit mode

Isn't this meant by "parsing out only best reciprocal hits"?

ADD REPLY
0
Entering edit mode

Yes I have already parsed best reciprocal matches. But still the pairwise similarity output is huge due to such a large number of genomes. I'm not sure if MCL is capable of handling such a large matrix. So I'm need of some suggesting to perform such operation

ADD REPLY
0
Entering edit mode

How did you parse the reciprocal matches?

ADD REPLY
0
Entering edit mode

How many lines are there in your blast output file? Best reciprocal hits from 400 bacterial genomes shouldn't translate to 300 GB, no way..

ADD REPLY
0
Entering edit mode
9.5 years ago

First, it is unclear how much nodes/edges do you have, 300Gb raw blast output doesn't tell anything.

What is the value of -I parameter you are using? Try setting a higher -I value, like 4 or 5 and see if MCL manages to finish. Also see "reducing node degree section" from here.

ADD COMMENT

Login before adding your answer.

Traffic: 1511 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6