How do I train my sequence to have higher score in using HMMER package
0
0
Entering edit mode
2.8 years ago
howenwy2 • 0

Hi, I am recently using a package named HMMER to predict the binding site of the sequence. Here is my input data:

CLUSTAL O(1.2.4) multiple sequence alignment

gene1 TTTGAGTGTGTTA 13

gene2 TTTGATCTGGTTA 13

gene3 ATTGAGGTAGTTA 13

gene4 TTTGAGGCTATTG 13

I need to find the (T/A)TTGANNNNNTT(G/A) in my genome sequence, and gene1 to gene4 is also the sequence from the same genome, and I need to find that sequence from the other genes in this genome.

Now I can find about 38 binding site in my target genome; however, the score is low. I hope there is a way to increase my score.

HMMER hidden markov model sequence analysis • 558 views
2
Entering edit mode

Why?

The score reflects how similar those sequences are. They cannot be made ‘more similar’. What do you want to gain from higher scores?

0
Entering edit mode

Thank you so much for replying. Can I trust the score lower than 10 (or lower than 5)? I got several genes with lower score, but those genes contain the specific sequence that I need.

1
Entering edit mode

as jrj.healey said: the score is what it is. can't change that. And seeing your input data this is far from un-expected (that is a very broad "motif" you are looking for) also given the fact that you are screening a whole genome with it.

The only thing you can do is to give better input data (== more specific) .

0
Entering edit mode

Thank you so much for replying. Can I trust the score lower than 10 (or lower than 5)? I got several genes with lower score, but those genes contain the specific sequence that I need.

0
Entering edit mode

Gut feeling I would say no (and I personally wouldn't either), but do check the HMMer docs to see if there is any advice on the score interpretation

Alternatively you could check how Interpro (interproscan) deals with this. they use kinda empirically determined threshold to decide between match and no-match

0
Entering edit mode

You’d have to look at how the score is defined in the docs, I don’t know this off the top of my head, and pick a reasonable sounding number.

Binding sites are notoriously ‘wonky’ and hard to predict, so you would be justified in considering lower-than-typical scores perhaps.

If the score is anything like an E-value, a score of 5 would imply that at least 5 other matches would arise by pure chance alone, so thats probably too high.

Don’t make the mistake of only using the numbers though. If you detect a match, and its genomic context looks valid (its in the right place adjacent to a gene etc), then there’s grounds to proceed. In, short, throw some intuition at the problem, don’t just take those scores blindly.

In answering your original question, you may be able to improve your HMM scores, by using a HMM built from more known examples of the binding site - this will allow for a more ‘informed’ HMM and may help to narrow down your hits.

Alternatively, there isn’t strictly any need to use HMMs for this at all. Since your binding site is pretty well defined, you could just use fuzzy nucleotide matching, e.g. via EMBOSS’s fuzznuc.