Hi all!
I'm working on subspecies sequences now, and have questions on selecting protein clustering/pangenome anlaysis tool.
I understand that Roary and Panaroo use CD-HIT and BLAST for first clustering to collapse highly similar proteins into one so that minimizing redundancy of data, and use MCL for clustering using pairwise similarity matrix created by BLAST, and finally give me a gene_presence_absence table.
But I don't understand that why some tools (e.g PPanGGoLiN) use only MMSeqs2. Difference between MMSeqs2 and CD-HIT is alignment-free or not, so that is the major advantage using MMSeqs2 is to minimize computing power and time?
If not, what can be the standard to choose clustering/pangenome tool?
I think you should read this paper: link