7.1 years ago by
When training a profile HMM using a multiple sequence alignment as an input, first we have to define the so-called "consensus columns". There are many ways to define the consensus columns, for example:
- Specify them manually by annotating the input alignment.
- Define them as columns having less proportion of gaps than a given
- Use the "MAP model construction algorithm", which builds the model maximising the likelihood of the data given a prior on the number of consensus columns.
HMMER 3.0 seems to implement the first two approaches.
Now that we have the consensus columns marked, a match and delete state is assigned to all of them. The residues from the columns are used to calculate the emission probabilities for the match states by Maximum Likelihood (usually with some pseudocounts added coming from substitution matrices) and the "gappiness" of the consensus columns is used to estimate the transition probability (by ML) to the associated delete state (which is a silent state and so has no emission probabilities).
The non-consensus columns are assigned to insert states. The residues from these columns (plus pseudocounts) are used for the ML estimation of the emission probabilities of the insert states.
This is just a simplified description of profile HMM training, but I hope it helps, for more details see:
Durbin, Eddy, Krogh, Mitchison (1998): Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids