Question: Hidden Markov Model On Copy Number Data
1
gravatar for Vikas Bansal
7.1 years ago by
Vikas Bansal2.3k
Berlin, Germany
Vikas Bansal2.3k wrote:

Hi everyone,

I am working on copy number analysis and want to apply HMM on my data.

Say, I have data for1 individual with ~60k windows. I know about each window that if there is gain, loss or normal copy number. Eg-

chr1       0           100         Loss
chr1       500         600         Loss
chr1       600         700         Gain

What I want to do-

I want to find if any window contains the observed state due to errors. So I want to have true state based on previous states. Eg - I have, say, 10 windows which have following copy number-

Loss
Loss
Loss
Loss
Normal
Loss
Loss
Loss
Loss
Loss

In the above example, we can say that, the copy number in 5th window (Normal) is probably due to some errors, so we can set the true state of 5th window as Loss. (this is only one simple example as there will be lot more different cases where we cannot decide just by looking).

What I have understood-

  1. I can define my 3 states as - Gain, Loss and Normal.

  2. Then I can randomly assign state transition probability and observation probability.

  3. Then apply Baum-Walch algo for fitting parameters (to normalize my random probabilities based on sequence of states in my 60k windows).

  4. Then apply Viterbi algo for getting the true states.

Questions-

Do you think it is appropriate to apply HMM on my data or I misunderstood everything wrong and it is not a good idea?

If HMM will work, can somebody tell me if I need to change something in my aforementioned steps.

Although Baum algo will be used for fitting but I really have a bad feeling for assigning probabilities randomly in the beginning?

P.S: Please let me know if I should post this question on stats stack exchange but I thought it makes more sense to post it here (Biological data + Algorithms).

Thanks in advance,

Vikas

EDIT: If you think this problem can be solved by using some other algo or procedure, please let me know.

hmm cnv • 2.3k views
ADD COMMENTlink modified 7.1 years ago by Qdjm1.9k • written 7.1 years ago by Vikas Bansal2.3k

somehow my comment never posted in the morning. Anyways I was just wondering can`t you just use a simple machine learning classifier to classify your dataset into those 3 classes?

ADD REPLYlink written 7.1 years ago by Gjain5.3k

Can you please elaborate or it would be great if you can put a small example as an answer that how will you deal with this problem? (Please see my Edit)

ADD REPLYlink written 7.1 years ago by Vikas Bansal2.3k

Sure. This paper should give you a brief overview and a good introduction http://www.informatica.si/PDF/31-3/11_Kotsiantis%20-%20Supervised%20Machine%20Learning%20-%20A%20Review%20of...pdf

ADD REPLYlink written 7.1 years ago by Gjain5.3k
2
gravatar for Qdjm
7.1 years ago by
Qdjm1.9k
Toronto
Qdjm1.9k wrote:
  1. If your input is already Gain, Loss and Normal then it's not clear how much more useful an HMM would be over a simple heuristic.
    1. Consider using maximum marginal probability (MMP) to infer Gain, Loss or Normal at each position rather than Viterbi. Viterbi gives you the most likely path through the states, MMP tells you the most likely state at each position.
    2. In general, starting with random parameters is fine. However, in your case, I presume that you want each state to correspond to one of Gain, Loss or Normal. If you randomly initialize, there's no guarantee that this will happen because HMMs can get stuck in local minima. So I recommend initializing the parameters to point the HMM toward the answer you expect by setting them based on what you think that the final parameter values will be, e.g. initializing the "Gain" state to have a high probability of outputting "Gain" and a small probability of outputting the other states. Baum-Welch will refine your initial settings to make them a better match to the data. However, be careful about assigning zero probabilities, because Baum-Welch will keep that probability equal to zero. Of course, if you think that the zero is appropriate then use it.
ADD COMMENTlink modified 7.1 years ago • written 7.1 years ago by Qdjm1.9k

Hi, Thanks for your reply. I have some questions. "If your input is already Gain, Loss and Normal then it's not clear how much more useful an HMM would be over a simple heuristic" - can you please explain this little bit that why it is not clear?

Can you please provide some good citations for "MMP" (it would be great if includes the comparison with Viterbi) ?

ADD REPLYlink modified 7.1 years ago • written 7.1 years ago by Vikas Bansal2.3k
2
  1. If your observed data is already clearly defining the predicted state, then there's no hidden states to learn -- so an HMM-based smoothing of your data is going to be roughly equivalent to a simple rule like : "don't change state unless you see two observations of the new state in a row", depending on how often state changes occur.

  2. You can calculate the marginal distribution probability of the hidden state using forward-backward. It's called "smoothing" in the HMM Wikipedia article. Viterbi computes "the most likely explanation", which is described in the next paragraph in that article.

ADD REPLYlink modified 7.1 years ago • written 7.1 years ago by Qdjm1.9k

Thanks for your reply. I will read about this.

ADD REPLYlink written 7.1 years ago by Vikas Bansal2.3k

Hi! I read about smoothing and now I understood the difference between the output of viterbi and smoothing although I have some confusion but I think that question is more suitable for stats exchange. Just a small question, if I would use smoothing, should I run "Baum-Welch" for fitting first?

ADD REPLYlink modified 7.1 years ago • written 7.1 years ago by Vikas Bansal2.3k

Yes. Smoothing is just a different way of deciding on a hidden state sequence.

ADD REPLYlink written 7.1 years ago by Qdjm1.9k

Thanks. I posted some question related to this at stats exchange here. From the answer - "It is generally not possible to just paste together the most probable states from the marginal conditional distributions to a sequence and claim that the resulting sequence has merits as a sequence.". Any comments?

ADD REPLYlink written 7.1 years ago by Vikas Bansal2.3k

See the last sentence: " I would recommend, if possible, to avoid the hard imputation and work with the conditional distribution of states given emissions as provided by the model. For instance through simulations." The recommendation of NRH is the same as mine. You do want to be careful if the transition matrix has zero probabilities, and "smoothing" only tells you the most likely state at each point, which I assume is closer to what you want than the sequence of hidden states. I think that you've got enough feedback from us -- do a bit of work on your own and figure it out for yourself, that's what grad school is for.

ADD REPLYlink written 7.1 years ago by Qdjm1.9k

I am working on it but sometimes get confused. Thanks a lot for your help.

ADD REPLYlink written 7.1 years ago by Vikas Bansal2.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 747 users visited in the last hour