Question: Understanding Profile-Hmms Emission Probabilities
0
gravatar for Damian Kao
7.1 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

I am trying to understand the concept behind profile-hmms. I have a question on emission probabilities.

Every residue position in a profile-hmm has a match, insertion, deletion state. The match state has emission probabilities based on the sequence it was trained on.

What does the emission probability of the insertion and deletion states look like? How are they trained? Would the probability just be evenly split for all the symbols?

• 4.2k views
ADD COMMENTlink modified 5.4 years ago by Biostar ♦♦ 20 • written 7.1 years ago by Damian Kao15k
3
gravatar for Botond Sipos
7.1 years ago by
Botond Sipos1.7k
United Kingdom
Botond Sipos1.7k wrote:

When training a profile HMM using a multiple sequence alignment as an input, first we have to define the so-called "consensus columns". There are many ways to define the consensus columns, for example:

  • Specify them manually by annotating the input alignment.
  • Define them as columns having less proportion of gaps than a given threshold.
  • Use the "MAP model construction algorithm", which builds the model maximising the likelihood of the data given a prior on the number of consensus columns.

HMMER 3.0 seems to implement the first two approaches.

Now that we have the consensus columns marked, a match and delete state is assigned to all of them. The residues from the columns are used to calculate the emission probabilities for the match states by Maximum Likelihood (usually with some pseudocounts added coming from substitution matrices) and the "gappiness" of the consensus columns is used to estimate the transition probability (by ML) to the associated delete state (which is a silent state and so has no emission probabilities).

The non-consensus columns are assigned to insert states. The residues from these columns (plus pseudocounts) are used for the ML estimation of the emission probabilities of the insert states.

This is just a simplified description of profile HMM training, but I hope it helps, for more details see:

Durbin, Eddy, Krogh, Mitchison (1998): Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids

ADD COMMENTlink written 7.1 years ago by Botond Sipos1.7k

Thanks. I think I got it now. So the insertion state has the probability of the "consensus insertions". I guess the deletion state would have to be silent, since there isn't any symbols to represent it in the observed sequence.

ADD REPLYlink written 7.1 years ago by Damian Kao15k

Small clarification: Insertion states are defined by a "stretch" of consecutive non-consensus columns (possibly of zero length). The length of the stretch is used to estimate the self-transition probability (modelling insertion length). The emission probabilities for the insert state are calculated based on the residues in the stretch, but if I am not mistaken that is usually dominated by the pseudocounts in the case of HMMER profiles.

ADD REPLYlink written 7.1 years ago by Botond Sipos1.7k
1
gravatar for Gjain
7.1 years ago by
Gjain5.3k
Göttingen, Germany
Gjain5.3k wrote:

hi DK,

this article from nature should help you understand everything you asked.

What is a hidden Markov model?

alt text

For general understanding HMM

I hope this helps.

ADD COMMENTlink written 7.1 years ago by Gjain5.3k

Thanks Gjain. I've actually read through those two links already. I have a decent grasp of how HMMs work generally. I just have a question specifically on profile-hmms and how emission probabilities are trained for insertions and deletions.

ADD REPLYlink written 7.1 years ago by Damian Kao15k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2112 users visited in the last hour