CD-HIT Clustering evaluation
1
0
Entering edit mode
19 months ago
BDK_compbio ▴ 90

I am using CD-HIT to cluster some protein sequences and I would like to evaluate the performance of the clustering for my dataset. Is there any tool for this provided I have a benchmarked clustering results for those sequences?

Also, Is there any script available to collect the actual sequences from cd-hit result file i.e. actual sequences instead of names in the following results

>Cluster 0
0 2799aa, >PF04998.6|RPOC2_CHLRE/275-3073... *
>Cluster 1
0 2214aa, >PF06317.1|Q6Y625_9VIRU/1-2214... at 80%
1 2215aa, >PF06317.1|O09705_9VIRU/1-2215... at 84%
2 2217aa, >PF06317.1|Q6Y630_9VIRU/1-2217... *
3 2216aa, >PF06317.1|Q6GWS6_9VIRU/1-2216... at 84%
4 527aa, >PF06317.1|Q67E14_9VIRU/6-532... at 63%


UPDATE: for clustering performance evaluation, I am using scikit

CD-HIT Clustering RNA-Seq sequencing • 722 views
1
Entering edit mode

Not a direct answer to your question, but hopefully this will be useful to you. CD-HIT clusters only based on percent identity, which may not be the best property if your goal is to group sequences by evolutionary relatedness. An example: if sequences A-B are 70% identical, B-C are 80% identical and A-C are 75% identical, chances are high that all of them are related to each other. When you cluster at 70% by CD-HIT they will end up in the same cluster, but not if you do it at 80%.

If your goal is to actually cluster sequences by their relatedness, MCL is probably better than CD-HIT. There are scripts in that package that make it easy to do the clustering and to compare clustering solutions against a benchmark.

0
Entering edit mode

Thanks !! I will definitely try the MCL tool.

0
Entering edit mode

Could you please point me to the script that I can use for comparing clustering results with the benchmark? I have my clustering as well as the benchmark in the following tsv format (clusterId \t Name)

0       AT4G16950_0
0       AT4G16950_2
0       AT4G16950_1
1       AT2G20950_1
1       AT2G20950_5
1       AT2G20950_4
1       AT2G20950_9

0
Entering edit mode

To compare MCL clusters you need to run clm in dist mode (clm dist). See here for a summary of various functions contained in the MCL package. Note that input files need to be in MCL's matrix format, not the tsv as you listed above.

The output of clustering evaluation looks like this:

d=0.343 d1=0.105        d2=0.238        nn=14676        c1=1713 c2=1645 n1=scop95_175-superfam-clusters.mcl12   n2=35.mcl12
rand=0.99810 jaccard=0.79318 arand=0.88371


The first line shows the numbers of clusters in reference (c1) and your solution, and d values represent distances - smaller is better. Second line shows Rand, Jaccard and adjusted Rand measures of clustering - closer to 1 is better.

0
Entering edit mode

There are scripts in that package that make it easy to do the clustering and to compare clustering solutions against a benchmark.

Could you be more specific? There is an enormous amount of documentation on MCL and as a newbie to clustering, I'm finding it difficult to pick out what I actually need from this.

Similar to the original poster, want to cluster my sequences by relatedness.

3
Entering edit mode
19 months ago

yeah, one of the problems of bioinformatics are these odd data formats that do not lend themselves to automation. Here it looks like you would need to fashion a little data parser in a programming language. Here is a beast in awk:

cat out.clstr | awk ' /Cluster/ { no+=1;}; !/Cluster/ { id=substr($3, 2, length($3)-4); printf("%s\t%s\n", no, id) } '


will print the cluster number and sequence id that you can then use to extract the sequence:

1   PF04998.6|RPOC2_CHLRE/275-3073
2   PF06317.1|Q6Y625_9VIRU/1-2214
2   PF06317.1|O09705_9VIRU/1-2215
2   PF06317.1|Q6Y630_9VIRU/1-2217
2   PF06317.1|Q6GWS6_9VIRU/1-2216
2   PF06317.1|Q67E14_9VIRU/6-532

0
Entering edit mode

Thanks. This is useful. I just wrote a program to do that task, but using awk is much easier.