What's the point in estimating the background distribution in probabilstic models of motif finding such as MEME (mixture model with EM) or some HMM?
1
0
Entering edit mode
2.3 years ago
guixien • 0

I suppose that in the database searching for homologous sequence using the profiles created by whether MEME (represented by a PWM) or a HMM (represented by some profile HMM) make use of the log odds ratio with the random model being an independent process with the emission probability being the discrete uniform distribution. If so, then what's the point in those probabilistic model estimating the emission probability of the background in their training?

Or, should I say the valuable part of MEME and HMM is that they make use of expectation maximization so that the conditional expected likelihood in each iteration the likelihood is monotonically increasing (as opposed to some heuristic method)?

I also noticed that most of the HMM packages (e.g. HMMER3 instead of HMMER2) mostly use multiple sequence alignment to train the HMM instead of using Baum-Welch. Doing so of course, avoid estimating the background probability.

If anyone could provide a big picture I'd be appreciated.

motif sequence analysis sequence homology • 579 views
0
Entering edit mode
2.3 years ago
Mensur Dlakic ★ 15k

Background distribution of residues will affect the scoring.

Score = Sum [Pb(i) * (log2(Pb(i)/P0b)]

where b=A,C,G,T (for DNA), Pi is the residue at a given position, and P0b are residue background frequencies. If a sampled frequency of A is 0.35, the motif scanning score will be much different when background frequency of A is 0.25 compared to 0.35. The same is true for HMMs, although most HMMs use AA frequencies estimated from large protein databases rather than proteins of individual organisms or a group of related organisms.

0
Entering edit mode

Yes. I am guessing that's the reason why HMMER3 does not include a Baum-Welch training procedure to train a profile HMM, since eventually the estimated background distribution during the training is going to be "discarded" anyway when performing a homology search in a database?

1
Entering edit mode

I don't know the exact answer as I didn't go through the HMMer code, but I don't think that global background frequencies are discarded during search.

These papers may have the answer you are looking for:

https://www.ncbi.nlm.nih.gov/pubmed/20180275

https://www.ncbi.nlm.nih.gov/pubmed/22039361