Help speeding up HMMER's HMMSearch algorithm for large fasta file with GNU Parallel
1
0
Entering edit mode
10 weeks ago
O.rka ▴ 390

I've seen that HMMER can be sped up with GNU Parallel: Speed of hmmsearch

I have around 100,000 sequences and a HMMER database of around 300 HMM profiles.

I’m running everything at once but I’m wondering if it’ll be faster to split up the sequences and/or split up the jobs.

There’s a few ways to do it, if I'm working with 16 threads:

  1. Should I split up the sequences equal to the number of threads and run in parallel with 1 thread each? For example, split sequences into 16 files and then running 16 jobs at 1 thread?

  2. Should I split up into around 1000 files and then run HMMSearch for each of them one at a time but at full 16 threads?

  3. Should I split up into 1000 files, then run 16 of them at a time, each with 1 thread?

What’s the best way to do it?

Are there any wrapper that do this for you?

If I wanted to do option 1, how would I do this?

For simplicity, instead of 16 threads, let's just use 4 threads.

I have my large fasta file (in reality it's a lot of different fasta files of different size that I cat together),

# The pipeline I'm working on uses seqkit as the fasta parsing and manipulation package
cat *.faa | seqkit split -p 4 -O proteins/

In proteins/ there will be 4 fasta files:

stdin.part_001.fasta, stdin.part_002.fasta, stdin.part_003.fasta, stdin.part_004.fasta

Here's my marker databases:

DB_1=MarkerSet1/genes.hmm.gz
DB_2=MarkerSet2/genes.hmm.gz
DB_3=MarkerSet3/genes.hmm.gz

How would I run hmmsearch with GNU Parallel using each of these 4 fasta files that I split up?

I'm trying to adapt this code here from a previous BioStars post to my situation with the multiple files:

function hmmer() {
    n=$(basename "$1")
    hmmsearch -Z 2000000 --domZ 89 --cpu 1 -o $1.output.hmmsearch.txt stockholm89.hmm $1
}

export -f hmmer
find /where/the/split/files/are/ -maxdepth 1 -type f -name "*specific2splitFiles" | parallel -j 20 hmmer {}

I'm expecting an output like this but I don't understand how to adapt the parallel command:

stdin.part_001.MarkerSet1.tblout
stdin.part_002.MarkerSet1.tblout
stdin.part_003.MarkerSet1.tblout
stdin.part_004.MarkerSet1.tblout
stdin.part_001.MarkerSet2.tblout
stdin.part_002.MarkerSet2.tblout
stdin.part_003.MarkerSet2.tblout
stdin.part_004.MarkerSet2.tblout
stdin.part_001.MarkerSet3.tblout
stdin.part_002.MarkerSet3.tblout
stdin.part_003.MarkerSet3.tblout
stdin.part_004.MarkerSet3.tblout
protein parallel hmmer hmm bash • 204 views
ADD COMMENT
1
Entering edit mode
10 weeks ago
Mensur Dlakic ★ 14k

I haven't tried this exact exercise, but from previous experience I don't think options 1 & 3 will work well. Both of them would have 16 independent search processes running, which also means 16 read/write processes. Unless you have an ultra-fast disk - and even if you do - that will most likely be the choking point.

Option 2 would run only one read/write process at a time so there won't be a problem like above. But why split everything into 1000 chunks? hmmsearch is multithreaded, so simply use the maximum --cpu and run everything at once - all HMMs and all your sequences with a single command.

As always, you can take a small subset of your data, say 5-10,000 sequences, and try these options with your architecture.

ADD COMMENT

Login before adding your answer.

Traffic: 2208 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6