Any multiple sequence alignment (MSA) can be converted to a profile HMM (pHMM). And I DO understand that mathematical modeling of the diversity at each alignment position in an MSA can be used to score matches using something like HMMER2 / HMMER3 / HHpred etc.
However, I am curious to know if there are established guidelines for what % identity amongst sequences should be ideally, in order to balance signal and noise in the pHMM, so that both sensitivity and specificity of detecting sequence homologs are as high as possible.
I could argue that an MSA composed of sequences that are < 20% pair-wise identity would be hard to justify without solid evidence of structural or functional equivalence despite poor sequence conservation. So where should I stop in terms of diversity of sequences during MSA inference, if I am going to build pHMMs from these MSAs?
Links to any published literature on this topic would be much appreciated. Thanks folks!