Question: E-value results from hmmsearch are not accurate. [HMMER]
0
gravatar for danzanzu
23 months ago by
danzanzu0
danzanzu0 wrote:

Good Afternoon,

I am utilizing hmmer tools to analyze and better understand a DNA Sequence dataset that I have obtained referred to as the H3 Dataset containing dna sequences both being from class 0 and class 1.

The following is what I did:

  • Acquired 7000 DNA Sequences of the H3 Dataset which are Class 0, and built an hmm profile for 80% of it, resulting in 5600 sequences. fastas
    free photo hosting for ebay
    profile

  • Next, is that I took 20% of the dna sequences from both the Class 0 and Class 1. fastas2
    top free photo hosting sites
    fastas3

  • Now I performed hmmsearch command using both the 20% dna sequences as search criteria on the previously formed HMM Profile. What I expect is that the hmmsearch performed on the Class 0 Sequences is to have a lot of DNA sequences above the inclusion threshold, and also have lots of e-values which are near the 0 value. hmmsearch

Resulting in the below output file. test

How come no targets have been detected? I though I did something wrong until this point, so I performed one last experiment.

  • I performed an hmmsearch on the hmm profile, having the search criteria of the same sequences which formed the profile, which when thinking about it, there must be matches since they are the same sequences in the exact form, and the result out file is the below:

test2 test3

Once again, no hits were detected.

So my final question is: Am I using the hmmbuild and hmmsearch in the correct way and how can I improve the results in any form? It is extremely strange that I am comparing the exact same sequences and getting no hits. Any help would be appreciated

Thanks.

sequencing sequence • 1.6k views
ADD COMMENTlink modified 23 months ago by Shyam130 • written 23 months ago by danzanzu0
2

In the output of hmmbuild, "eff_nseq" is too high and "re/pos" is too low. I think that is because your input multi-FASTA file is not aligned and resulting hmm is nonsense. Thus does not hit against any sequence.

ADD REPLYlink modified 23 months ago • written 23 months ago by fishgolden410
1

I agree with that. Looks like the input is a random alignment. The title of the question is misleading.

ADD REPLYlink written 23 months ago by cryptogenomicon150
3
gravatar for Shyam
23 months ago by
Shyam130
United States
Shyam130 wrote:

For building a hmm from sequences you need to make a multiple sequence alignment first and use that alignment in fasta format as input for hmmbuild. If didnt align the sequences there is no way the resulting profile can predict any homology when you search back the input sequences. Hope this answers your question.

ADD COMMENTlink written 23 months ago by Shyam130

I thought about what you said. So what I did was to download to MUSCLE multiple sequence alignment tool, to align the sequences so that the output could be fed to the hmmbuild and form a meaningful hmm profile.

This is the aligned output from the MUSCLE program. test
free html images

Now I fed this aligned output produced to the hmmer software to build the profile in the following manner: test2

However after aligning the sequences still the NSEQ is too high and the re/pos is too small, resulting in no hits.

ADD REPLYlink written 23 months ago by danzanzu0
1

Are you sure that those nucleotide entries in your fasta are from the same family, the same super family or the same fold that you want to make profile? I googled the names of the entries and found that they have various descriptions which I thought they must belong to different families (SUL1 and SUL2 might belong to the same). Making profiles using entries which do not have evolutionary relationships are also nonsense (in most of the time).

Or the sequences do have evolutionary relationships, but if they are too diverged, hmm construction will fail, too. But in that case, eff_nseq will be lower.

(& when you use DNA, be careful for directions of the strands.)

ADD REPLYlink modified 23 months ago • written 23 months ago by fishgolden410

The Nucleotide entries where taken from a research paper and can be found here, so that you can have a better understanding of the data: http://www.jaist.ac.jp/~tran/nucleosome/members.htm

Now in the above example I took a random 80% of the H3 negative class dna sequences, and build a multiple sequence alignment file using Muscle, and the building of the hmm profile is above.

I am still stuck with trying to formulate a good hmmbuild profile, since I am a bit of a beginner.

Any suggestions where to continue?

ADD REPLYlink written 23 months ago by danzanzu0
1

You are using "negative" dataset? Does it mean your hypothesis is that there are histone avoiding motifs in the dataset and you were going to model Histone avoiding motif with HMM? It is very interesting.

I'm not a nucleotide person and have not used HMMER or MUSCLE so much, following comments are based on my insufficient knowledge, but very general, I think.

(Please correct me if I am wrong, somebody)

Problems:

  • The dataset you are using contains sequences (entries) which do not have evolutionary relationships (Because it is result of chip-chip).
  • Chip-chip data contains not only histone binding motifs but also unrelated regions around them.

As I mentioned previous comment, HMM made from such unrelated sequences is nonsense. But I want to add exception, though sequences do not have evolutionary relationships, sometimes you can build HMM when they have some strong pattern like signal peptide or transmembrane regions. However, I don't think Histone binding or avoiding motifs have such a strong pattern.

  • HMMER and MUSCLE, sequence searcher or aligner, are designed to find and align evolutionary related sequences or regions.

Such evolutionary related regions are independent (I think) from Histone avoiding motif. Therefore, Histone avoiding motif will be corrupted in the resulting alignment. (& if you successfully made HMM of Histone avoiding motif, the motif is widely distributed in the genome. The e-value might become very high. But I'm not sure. A mere conjecture)

But the idea is interesting.

Normal HMM build pipeline was failed... The author of the dataset you referred is using k-gram+SVM?... then if we can construct k-gram HMM??? or cluster related sequences and separately make HMM ... Hmmmmmm..... I think it requires further investigation of published papers (may be someone has tried) and very deep discussions.

I think you should discuss with your supervisor or someone who has much knowledge about this field.

ADD REPLYlink modified 22 months ago • written 22 months ago by fishgolden410
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 913 users visited in the last hour