Help speeding up HMMER's HMMSearch algorithm for large fasta file with GNU Parallel
Entering edit mode
15 months ago
O.rka ▴ 620

I've seen that HMMER can be sped up with GNU Parallel: Speed of hmmsearch

I have around 100,000 sequences and a HMMER database of around 300 HMM profiles.

I’m running everything at once but I’m wondering if it’ll be faster to split up the sequences and/or split up the jobs.

There’s a few ways to do it, if I'm working with 16 threads:

  1. Should I split up the sequences equal to the number of threads and run in parallel with 1 thread each? For example, split sequences into 16 files and then running 16 jobs at 1 thread?

  2. Should I split up into around 1000 files and then run HMMSearch for each of them one at a time but at full 16 threads?

  3. Should I split up into 1000 files, then run 16 of them at a time, each with 1 thread?

What’s the best way to do it?

Are there any wrapper that do this for you?

If I wanted to do option 1, how would I do this?

For simplicity, instead of 16 threads, let's just use 4 threads.

I have my large fasta file (in reality it's a lot of different fasta files of different size that I cat together),

# The pipeline I'm working on uses seqkit as the fasta parsing and manipulation package
cat *.faa | seqkit split -p 4 -O proteins/

In proteins/ there will be 4 fasta files:

stdin.part_001.fasta, stdin.part_002.fasta, stdin.part_003.fasta, stdin.part_004.fasta

Here's my marker databases:


How would I run hmmsearch with GNU Parallel using each of these 4 fasta files that I split up?

I'm trying to adapt this code here from a previous BioStars post to my situation with the multiple files:

function hmmer() {
    n=$(basename "$1")
    hmmsearch -Z 2000000 --domZ 89 --cpu 1 -o $1.output.hmmsearch.txt stockholm89.hmm $1

export -f hmmer
find /where/the/split/files/are/ -maxdepth 1 -type f -name "*specific2splitFiles" | parallel -j 20 hmmer {}

I'm expecting an output like this but I don't understand how to adapt the parallel command:

protein parallel hmmer hmm bash • 863 views
Entering edit mode
15 months ago
Mensur Dlakic ★ 21k

I haven't tried this exact exercise, but from previous experience I don't think options 1 & 3 will work well. Both of them would have 16 independent search processes running, which also means 16 read/write processes. Unless you have an ultra-fast disk - and even if you do - that will most likely be the choking point.

Option 2 would run only one read/write process at a time so there won't be a problem like above. But why split everything into 1000 chunks? hmmsearch is multithreaded, so simply use the maximum --cpu and run everything at once - all HMMs and all your sequences with a single command.

As always, you can take a small subset of your data, say 5-10,000 sequences, and try these options with your architecture.


Login before adding your answer.

Traffic: 1347 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6