Question

Can nhmmer predict short dna (~20 bp) motifs from HMM profiles?

0

Entering edit mode

4.0 years ago

Taylor • 0

Hi Everyone,

Background: I am working on a project where I want to predict putative short transcription regulatory motifs (~20 DNA bp). I do not typically work with sequences this short nor do I have a great sense on how people tend to predict (if at all) short DNA motifs. Nonetheless, my inclination is that HMMER (I am using v3.1b2) might be suitable for the task. To do this, I generated a multiple sequence alignment using a combination of sequences from previously published work and some sequences a colleague provided me. The alignment indicates a conserved CTG feature (mean frequencies 83%,100%,83%), followed by a moderately conserved CAG feature (mean frequencies 50%, 50%, and 75%, respectively), and last a more conserved CAG feature (mean frequencies 92%,75%,92%)--the CTG and second CAG feature were consistent with what was previously reported for this particular transcription regulatory motif.

Issue: My issue is that after generating the HMM profile using hmmbuild (default parameters), I try predicting the motif using nhmmer (default parameters) on the original DNA sequences used to generate the HMM profile--the idea being that all (or most) of the sequences should return as positive hits. As it turns out, none of the sequences return as positive hits. Inspecting the output file suggests that the forward parsing filter removes most sequences; however, utilizing the --max flag with nhmmer (which to my understanding removes the heuristic filters and returns everything meeting the e-value threshold) yields only few hits with unimpressive E-values (~0.1 to 1).

Question: My question is twofold. Can nhmmer feasibly predict DNA sequence motifs as short as 20 bp? If my understanding of HMMER is correct, the query sequence significance is determined based on how a HMM profile aligns with the query sequence with respect to a null model (I think the DNA frequencies are based on Swiss-prot genes). I can imagine since the conserved regions are so short, the query sequence and the random model are too comparable in likelihoods to confidently return a positive match. This I do not know. If nhmmer is theoretically sensitive enough to detect such a short DNA motif, how can I alter the HMMER parameters to improve my predictions.?

Thank you for any input.

HMMER nhmmer • 1.3k views

ADD COMMENT • link updated 4.0 years ago by Mensur Dlakic ★ 29k • written 4.0 years ago by Taylor • 0

score 1 · Answer 1 · 2021-07-13

You are correct: nhmmer is not meant for short sequences. I don't think you should be spending time correcting the HMMer code, because there are numerous motif-building tools that are meant for short sequences. I suggest the MEME suite as it has tools to detect, scan and score short nucleic motifs. A simple pipeline would be to use MEME or GLAM2 to detect motifs; FIMO or MAST to scan a sequence for the presence of these motifs; TOMTOM to compare with known motifs.

You keep saying "predict motifs" where I assume you mean "score motifs" because there is nothing to predict here.

The two conserved mini-motifs you found (CTG & CAG) are reverse complements of each other. Depending on their spacing (say, 10-12 bp from the middle bases), you may have a globally palindromic motif. For example, the famous CAP/CRP transcription regulator has a somewhat similar motif where conserved regions are TGTG & CACA. LexA repressor has those same CTG & CAG mini-motifs separated by approximately a DNA turn.

enter image description here