Help speeding up HMMER's HMMSearch algorithm for large fasta file with GNU Parallel
1
0
Entering edit mode
10 months ago
O.rka ▴ 600

I've seen that HMMER can be sped up with GNU Parallel: Speed of hmmsearch

I have around 100,000 sequences and a HMMER database of around 300 HMM profiles.

I’m running everything at once but I’m wondering if it’ll be faster to split up the sequences and/or split up the jobs.

There’s a few ways to do it, if I'm working with 16 threads:

1. Should I split up the sequences equal to the number of threads and run in parallel with 1 thread each? For example, split sequences into 16 files and then running 16 jobs at 1 thread?

2. Should I split up into around 1000 files and then run HMMSearch for each of them one at a time but at full 16 threads?

3. Should I split up into 1000 files, then run 16 of them at a time, each with 1 thread?

What’s the best way to do it?

Are there any wrapper that do this for you?

If I wanted to do option 1, how would I do this?

I have my large fasta file (in reality it's a lot of different fasta files of different size that I cat together),

# The pipeline I'm working on uses seqkit as the fasta parsing and manipulation package
cat *.faa | seqkit split -p 4 -O proteins/


In proteins/ there will be 4 fasta files:

stdin.part_001.fasta, stdin.part_002.fasta, stdin.part_003.fasta, stdin.part_004.fasta


Here's my marker databases:

DB_1=MarkerSet1/genes.hmm.gz
DB_2=MarkerSet2/genes.hmm.gz
DB_3=MarkerSet3/genes.hmm.gz


How would I run hmmsearch with GNU Parallel using each of these 4 fasta files that I split up?

I'm trying to adapt this code here from a previous BioStars post to my situation with the multiple files:

function hmmer() {
n=$(basename "$1")
hmmsearch -Z 2000000 --domZ 89 --cpu 1 -o $1.output.hmmsearch.txt stockholm89.hmm$1
}

export -f hmmer
find /where/the/split/files/are/ -maxdepth 1 -type f -name "*specific2splitFiles" | parallel -j 20 hmmer {}


I'm expecting an output like this but I don't understand how to adapt the parallel command:

stdin.part_001.MarkerSet1.tblout
stdin.part_002.MarkerSet1.tblout
stdin.part_003.MarkerSet1.tblout
stdin.part_004.MarkerSet1.tblout
stdin.part_001.MarkerSet2.tblout
stdin.part_002.MarkerSet2.tblout
stdin.part_003.MarkerSet2.tblout
stdin.part_004.MarkerSet2.tblout
stdin.part_001.MarkerSet3.tblout
stdin.part_002.MarkerSet3.tblout
stdin.part_003.MarkerSet3.tblout
stdin.part_004.MarkerSet3.tblout

protein parallel hmmer hmm bash • 625 views
1
Entering edit mode
10 months ago
Mensur Dlakic ★ 20k

I haven't tried this exact exercise, but from previous experience I don't think options 1 & 3 will work well. Both of them would have 16 independent search processes running, which also means 16 read/write processes. Unless you have an ultra-fast disk - and even if you do - that will most likely be the choking point.

Option 2 would run only one read/write process at a time so there won't be a problem like above. But why split everything into 1000 chunks? hmmsearch is multithreaded, so simply use the maximum --cpu and run everything at once - all HMMs and all your sequences with a single command.

As always, you can take a small subset of your data, say 5-10,000 sequences, and try these options with your architecture.