Question

Why is hmmsearch only using the 1st HMM in a concatenated HMM database?

0

Entering edit mode

4.0 years ago

O.rka ▴ 710

I'm running hmmsearch via HMMER 3.3 with a concatenated HMM file. Here is the first 1000 lines of my tigrfams.hmm.gz. It looks like the file is properly merged with the following format:

[HMM]
//
[HMM]
//

I don't know why but I'm only getting a hit for the first HMM TIGR00001 and I have over a million protein sequences. I know there's definitely other hits but it's running quickly. I ran the same type of command on a concatenated PFAM db and KOFAM db...both worked :/ However, when I ran a concatenated PANTHER db I got a similar result as my TIGRFAM (only one hit).

Here is my command (it's an arrayjob for sungrid engine):

#!/bin/bash
###########
# Variables
###########
__INPUT=tigrfam_output/array_output/sequences/partition_${SGE_TASK_ID}.fasta.gz
__OUTPUT=tigrfam_output/array_output/output/hmmsearch.tigrfam_${SGE_TASK_ID}.tblout
__THREADS=1
#########
# Command
#########
time(hmmsearch -o /dev/null --tblout $__OUTPUT --cut_ga --cpu $__THREADS /usr/local/scratch/METAGENOMICS/jespinoz/db/tigrfam_db/tigrfams.hmm.gz $__INPUT)

What am I doing incorrectly?

hmm hmmer protein domain bash script • 2.2k views

ADD COMMENT • link updated 4.0 years ago by Mensur Dlakic ★ 27k • written 4.0 years ago by O.rka ▴ 710

score 2 · Accepted Answer · 2020-04-05

2

Entering edit mode

4.0 years ago

Mensur Dlakic ★ 27k

hmmsearch is meant for searching single HMMs at a time against a database of sequences. If you give it more than one HMM, for a large sequence database it will be a while before it gets to print out the matches to subsequent HMMs.

What each program in HMMer suite is doing is explained in its documentation.

ADD COMMENT • link 4.0 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Description hmmsearch is used to search one or more profiles against a sequence database. For each profile in hmmfile, use that query profile to search the target database of sequences in seqdb, and output ranked lists of the sequences with the most significant matches to the profile. To build profiles from multiple alignments, see hmmbuild. I interpreted this as use for multiple sequences. My sequence database is quite large so what you mentioned above could be the issue. It looks like hmmscan will be a better option and it seems like anvi'o also updated to use hmmscan.

What confuses me the most is that I used hmmsearch for the PFAM database and it seemed to work great? What could be different about the TIGRFAM database? They look the same when I open up the merged files.

ADD REPLY • link 4.0 years ago by O.rka ▴ 710

0

Entering edit mode

Not sure what the problem is, other than any HMM search will be slow on a large database when using a single thread. Also, nothing is wrong with TIGRFAMs database either.

It may be a good idea to troubleshoot it on a smaller database as you will get the answer faster. I just ran hmmsearch with a whole TIGRFAMs database against PDB (~110K sequences). The output was as expected, with all HMMs having hits. That was using 3 CPUs and still it took 32 minutes. I am sure you can do the math how long it would take with a single CPU and 100-200x larger database.