I have a metagenomic dataset with several hundred thousand, relatively small genes and I want to find the protein families that exist in there (based on their aminoacid sequence)...
I've tried cdhit with a 40% threshold as well as mclblastline with a varying inflation value of 3.0 up to 10.0. cdhit creates smaller (less members in each family) but more homogeneous families compared to mcl (as seen by the present domains / KEGG orthology).
I think I like better cdhit's clusters; the only thing that I don't like about it is that I can't decrease the threshold below 40% (which I think it may be necessary for such a task). On the other hand, mcl is used exactly for this kind of task, right?
Is there anyone who has done such an analysis so that he/she can give me some advice? I'm particularly interested in people's opinion about mcl.