I am using CD-HIT to cluster some protein sequences and I would like to evaluate the performance of the clustering for my dataset. Is there any tool for this provided I have a benchmarked clustering results for those sequences?
Also, Is there any script available to collect the actual sequences from cd-hit result file i.e. actual sequences instead of names in the following results
>Cluster 0 0 2799aa, >PF04998.6|RPOC2_CHLRE/275-3073... * >Cluster 1 0 2214aa, >PF06317.1|Q6Y625_9VIRU/1-2214... at 80% 1 2215aa, >PF06317.1|O09705_9VIRU/1-2215... at 84% 2 2217aa, >PF06317.1|Q6Y630_9VIRU/1-2217... * 3 2216aa, >PF06317.1|Q6GWS6_9VIRU/1-2216... at 84% 4 527aa, >PF06317.1|Q67E14_9VIRU/6-532... at 63%
UPDATE: for clustering performance evaluation, I am using scikit