Question: Why is hmmsearch only using the 1st HMM in a concatenated HMM database?
0
gravatar for O.rka
8 weeks ago by
O.rka170
O.rka170 wrote:

I'm running hmmsearch via HMMER 3.3 with a concatenated HMM file. Here is the first 1000 lines of my tigrfams.hmm.gz. It looks like the file is properly merged with the following format:

[HMM]
//
[HMM]
//

I don't know why but I'm only getting a hit for the first HMM TIGR00001 and I have over a million protein sequences. I know there's definitely other hits but it's running quickly. I ran the same type of command on a concatenated PFAM db and KOFAM db...both worked :/ However, when I ran a concatenated PANTHER db I got a similar result as my TIGRFAM (only one hit).

Here is my command (it's an arrayjob for sungrid engine):

#!/bin/bash
###########
# Variables
###########
__INPUT=tigrfam_output/array_output/sequences/partition_${SGE_TASK_ID}.fasta.gz
__OUTPUT=tigrfam_output/array_output/output/hmmsearch.tigrfam_${SGE_TASK_ID}.tblout
__THREADS=1
#########
# Command
#########
time(hmmsearch -o /dev/null --tblout $__OUTPUT --cut_ga --cpu $__THREADS /usr/local/scratch/METAGENOMICS/jespinoz/db/tigrfam_db/tigrfams.hmm.gz $__INPUT)

What am I doing incorrectly?

ADD COMMENTlink modified 8 weeks ago by Mensur Dlakic5.5k • written 8 weeks ago by O.rka170
2
gravatar for Mensur Dlakic
8 weeks ago by
Mensur Dlakic5.5k
USA
Mensur Dlakic5.5k wrote:

hmmsearch is meant for searching single HMMs at a time against a database of sequences. If you give it more than one HMM, for a large sequence database it will be a while before it gets to print out the matches to subsequent HMMs.

What each program in HMMer suite is doing is explained in its documentation.

ADD COMMENTlink written 8 weeks ago by Mensur Dlakic5.5k

Description hmmsearch is used to search one or more profiles against a sequence database. For each profile in hmmfile, use that query profile to search the target database of sequences in seqdb, and output ranked lists of the sequences with the most significant matches to the profile. To build profiles from multiple alignments, see hmmbuild. I interpreted this as use for multiple sequences. My sequence database is quite large so what you mentioned above could be the issue. It looks like hmmscan will be a better option and it seems like anvi'o also updated to use hmmscan.

What confuses me the most is that I used hmmsearch for the PFAM database and it seemed to work great? What could be different about the TIGRFAM database? They look the same when I open up the merged files.

ADD REPLYlink written 8 weeks ago by O.rka170

Not sure what the problem is, other than any HMM search will be slow on a large database when using a single thread. Also, nothing is wrong with TIGRFAMs database either.

It may be a good idea to troubleshoot it on a smaller database as you will get the answer faster. I just ran hmmsearch with a whole TIGRFAMs database against PDB (~110K sequences). The output was as expected, with all HMMs having hits. That was using 3 CPUs and still it took 32 minutes. I am sure you can do the math how long it would take with a single CPU and 100-200x larger database.

ADD REPLYlink written 8 weeks ago by Mensur Dlakic5.5k

Thanks again, I'll run some test examples right now.

ADD REPLYlink written 8 weeks ago by O.rka170
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1520 users visited in the last hour