Hello,
I have pairwise blast result for more than 400 bacterial genome each genome has more than 5000 sequences. I want to cluster the similarity information using MCL to identify protein families. However, when Im trying to run MCL clustering program on the output file. MCL is running out of memory even on machine with 24 GB RAM. The total size of BLAST output file is more than 300 GB after parsing out only best reciprocal hits. Can any one suggest me a way to perform this operation in better way?
Thank you
Apply a higher e-value cutoff?
The e-value cutoff is already high enough 1e-10. I'm trying to predict ortholog sequences.
Rerun blast with max_target_seqs 1 option
Isn't this meant by "parsing out only best reciprocal hits"?
Yes I have already parsed best reciprocal matches. But still the pairwise similarity output is huge due to such a large number of genomes. I'm not sure if MCL is capable of handling such a large matrix. So I'm need of some suggesting to perform such operation
How did you parse the reciprocal matches?
How many lines are there in your blast output file? Best reciprocal hits from 400 bacterial genomes shouldn't translate to 300 GB, no way..