I want to use HMM (forward backward model) for protein secondary structure prediction.
Basically, a three-state model is used: States = {H=alpha helix, B=beta sheet, C=coil}
and each state has a emission probability pmf of 1-by-20 (for the 20 amino acids).
After using a "training set" of sequences on the forward backward model, the expectation maximization converges for an optimal transitions matrix (3-by-3 between the three states), and emission probability pmf for each state.
Does anyone know of a dataset (preferably very small) of sequences for which the "correct" values of the transition matrix and emission probabilities are determined? I would like to use that dataset in Excel to apply the forward backward algorithm and build my confidence to determine whether or not I can get the same result.
And then move on to something less primitive than Excel :o)
Sounds fun. However, aren't there some pretty good models already out there to make structural predictions? I am not trying to discourage you, just curious if there is a new problem you a trying to tackle.
Hi Zev, Thanks for the message. I would like to model the simple case first (i.e. 3 states: {H=alpha helix, B=beta sheet, C=coil}) and then allow for more states (i.e. x states: {H1, H2 ,Hn, ... Hn, B1, B2, ... Bn, C1, C2, ..., Cn) similar to what was done in this interesting paper (http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1556511&tag=1).
I am a newbie in the field, so maybe one day I will have a new problem to tackle! But for now, I am familiarizing myself with established models.
Anyway, I read the paper very carefully, but would like to play with a dataset for which the "correct" values of the transition matrix and emission probabilities are determined. Do you know of such a set, or how I could obtain one? Thanks!