I am developing a tool to find homologs of a set of proteins in a metagenome. One of the steps is to take all the sequences of each protein and make models for it using HMMer. I am an undergrad student and dont know very much about HMM (i have read the basics though).
My primary question is whether I should cluster the highly similar sequences using CD-hit to remove redundancy ? Also, in case i should, what would be an ideal threshold ?
Another question is regarding sequences of highly variable lengths. If the representative sequences of a protein are highly variable in length, does it cause any problem ? I believe it shouldn't.
I would appreciate any help/guides/comments.
P.S. It is my first time posting here so please forgive if i have not followed some guidelines.