Why Doesn'T The "Markov Property" Disqualify Markov Chains For Sequence Analysis?
4
7
Entering edit mode
12.6 years ago
Ib84 ▴ 100

Sometimes, to me as a biologist, the asumptions that are at the basis of established bioinformatic methods seem inadequately oversimplifying the biological mechanisms that they try to model. I'm currently studying Markov chains, which seem to be used in several fields, also for sequence analysis, each nucleotide or amino acid being a "state". Compared to FSM, bayesian nets and petri nets, they have a very restrictive assumption: the markov property says that a state depends only on the previous state. I fail to see how this is acceptable in biology at all! The most evident incompatibility: How can any markov chain still be applicable to nucleotides, considering the nature of triplets, i.e. degeneracy of the genetic code? The transition probabilities from first to second base must be completely different than the transition from the second to the third!? Furthermore, the third base also depends on the first, not only the second. The only use I can see for markov chains, is for the "random sequence model", against which probability of a given sequence is compared, but even for the above would render it a bad random sequence model. I haven't understood yet, what else markov chains could be used for...

Further points are motifs, repeated sequences, hydrophobic vs hydrophile regions in a protein etc. The least important thing the occurence of a given nucleotide/aa in a sequence depends on, is its preceding nucleotide/aa...

thanks for some explanation

nucleotide sequence • 5.6k views
ADD COMMENT
0
Entering edit mode

in the little studying that I've done .. it seems the most applicable HMM comparison is between the assembled genome .. and the reference genome? ... but I'd really like someone to point out whats wrong about this...

ADD REPLY
9
Entering edit mode
12.6 years ago
Farhat ★ 2.9k

What you are describing is an order-1 Markov model, where the probability of a next base depends only on the base preceding it. This can be generalized to an order-k Markov model where the probability of observing a particular nucleotide at a given position along a sequence depends only on the previous k nucleotides. Thus, features like the triplet code, repeats etc. can be accounted for.

ADD COMMENT
2
Entering edit mode

Its a typical error. In First-Order Markov-Chains, a state does not only depend on the precursor-state. States are independent given there precursor state. That makes a mathematical difference.

ADD REPLY
2
Entering edit mode

There is nothing wrong with his answer. This is typical of HM and the observed is definitely dependent on the hidden state. When designing a model, the diffulty lies in identifying the hidden states and what 'order' is appropriate.

ADD REPLY
1
Entering edit mode

It is wrong, because it's imprecise. It's a typical question our professors ask in assignments. I really don't know how this opinion has spread so wide. First-Order Markov Models are able to model higher dependencies. A simple test: train a WAM and look at the mutual informations.

ADD REPLY
1
Entering edit mode

That's not true. You can ALWAYS transform k-order Markov model to 1-order Markov model. You might just end up with much higher number of states.

ADD REPLY
0
Entering edit mode

ah ok. They should say this in my books more clearly / at the right place. thanks Farhat

ADD REPLY
0
Entering edit mode

most do, what books are you reading?

ADD REPLY
4
Entering edit mode
12.6 years ago
Paul_Muller ▴ 70

A very prominent computational biologist once told me that "all models are wrong, some are useful and most are misleading". It is the grain of salt in every analysis I perform. This certainly doesn't answer your question, but it is worth remembering.

ADD COMMENT
0
Entering edit mode

Who heard it from someone who heard it from someone who hear it from http://en.wikiquote.org/wiki/George_E._P._Box. :)

ADD REPLY
0
Entering edit mode

@Paul & Daniel: hehe :-) on that link they also say: "the practical question is how wrong do they have to be to not be useful.":-)

ADD REPLY
4
Entering edit mode
12.6 years ago
Fabian Bull ★ 1.3k

Yes, using Markov Models we do make assumptions. This is mainly for two reasons:

  1. We have to do assumptions otherwise we wouldn't go anywhere.
  2. We introduce them to make computations feasible.

Some of the assumptions may be simplified but your statement of independence to all precursors except the last is wrong (for First-Order Markov Models included).

This is a common mistake even computational biologists do. It's the difference between independence and conditional independence. As an example, if A depends on B and B depends on C, then A will depend on C.

Surely First-Order Markov Models capture most of this dependency but not all. But with increasing order Markov Models do not increase their power. The highest order I ever see used was 5.

But you are right in the fact that there are more things to consider and this is of topic current research.

ADD COMMENT
2
Entering edit mode

The difference is kind of hart to figure out (Thats why most people fail) but Ill give it a try. First-Order Markov-Chain means: All states are independent GIVEN their precursor. Thats different to all states are independent except a state and its precursor. Thats why Markov-Chains are able to model dependencies longer then 1bp. As I said: If A depends on B and B depends on C then surely A depends on C. Surely Higher-Order Markov-Modells are better for modelling high dependecies but its wrong to say that First-Order Modells dont do that too. Any statistical that will say the same.

ADD REPLY
0
Entering edit mode

@peri4n. Hello. a) I'm "studying" clearly translates to "I don't claim to have much statistical knowledge." b) I'd appreciate if you could explain the difference between independance and conditional independence clearly. c) from the biological perspective it's pretty sure that the "transition probabilities" of the 2nd to 3rd nucleotide are most certainly not identical in all cases and not independant of the first nucleotide in the triplet. Please explain me then, how a 1st-order markov chain should reflect that?

ADD REPLY
0
Entering edit mode

In his paper (http://www.ncbi.nlm.nih.gov/pubmed/9228612) Henderson et al. describes VEIL system which being based on HMM handles codon position dependence using first-order Markov chains.

ADD REPLY
3
Entering edit mode
12.6 years ago

While Farhat's answer is correct, you still raise an important point. Yes, you can use higher-order Markov models to handle trinucleotides or other basic structure in DNA, but what about promoter regions upstream of a transcription start site? Intron splice junctions? DNA does violate a lot of these independence assumptions. Also consider the assumption (in HMMs) that one output is independent from another and depends only on the state. Considering codon biases, evolutionary pressures, etc, it's safe to say the DNA blows this assumption out of the water too.

Perhaps the answer to your concerns is that despite violating some of the independence assumptions, modeling DNA using Markov models (in general) and HMMs (in particular) still can be (and has been) informative and successful under certain circumstances.

ADD COMMENT
1
Entering edit mode

@lb84 My experience with Markov models is also limited. I assume that sequences with a lot of underlying (and unmodeled) structure would be the most problematic, but of course it's really hard to tell how well your model is doing unless you already have the right answer. I don't think this should discourage anyone from using HMMs when they can, people just need to make sure their assumptions are stated clearly and that the results need to be taken with a grain of salt.

ADD REPLY
0
Entering edit mode

hi daniel, thanks for your answer. I don't know enough to understand how higher-order markov models can still work although assumptions are not fit. There must be some consequences of those violations? Are the results less reliable, i.e. sometimes wrong?

ADD REPLY

Login before adding your answer.

Traffic: 1462 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6