Question

protein clustering and pangenome tools

0

Entering edit mode

22 months ago

Jonathan Yoou ▴ 60

Hi all!

I'm working on subspecies sequences now, and have questions on selecting protein clustering/pangenome anlaysis tool.

I understand that Roary and Panaroo use CD-HIT and BLAST for first clustering to collapse highly similar proteins into one so that minimizing redundancy of data, and use MCL for clustering using pairwise similarity matrix created by BLAST, and finally give me a gene_presence_absence table.

But I don't understand that why some tools (e.g PPanGGoLiN) use only MMSeqs2. Difference between MMSeqs2 and CD-HIT is alignment-free or not, so that is the major advantage using MMSeqs2 is to minimize computing power and time?

If not, what can be the standard to choose clustering/pangenome tool?

clustering pangenome • 1.1k views

ADD COMMENT • link updated 22 months ago by Mensur Dlakic ★ 27k • written 22 months ago by Jonathan Yoou ▴ 60

0

Entering edit mode

I think you should read this paper: link

ADD REPLY • link 22 months ago by andres.firrincieli 3.7k

score 1 · Answer 1 · 2022-09-22

MMSeqs2 has both search and clustering capabilities, so it can replace the other two tools. I have used both CD-HIT and MMSeqs2, and still do. They produce different clustering solutions at low identity thresholds (say, 40% and below), but should be very similar for higher identity threshold that is typically used for pangenomes. Don't know which one works better in this specific application, but I think you would be fine with either tool as both are well-known and have been thoroughly tested.