I have a dataset of more than 3000 protein sequences in fasta format for which i need to get the alignment in hhblits. Is there any way by which i can submit all the sequence at once rather than submitting one by one in hhblits.
You have a choice to make.
- hhblits allows you to specify a number of threads for HMM searching for each individual query. This means that that single query will finish faster.
- You can run each query with a single thread, but launch as many fasta jobs as your infrastructure will tolerate (up to your CPU count essentially).
Which of these will be faster, I have no idea.
Alternatively, lets say you had 30 cores, you could launch 6 fasta jobs at a time, each with 5 cores for the actual HMM searching.
To launch multiple jobs, look at some guides for GNU
Again, which of these scenarios will be fastest, I don’t know. If you opt for the last approach, combining parallel with the process having multiple threads, make sure the maths adds up to your total number of cores or less, else you’ll risk CPU thrashing.
After having a little chat with some colleagues and one of the developers of the tool, the general rule of thumb (not just for HHsuite) is that launching
n processes with 1 thread each will be faster than launching 1 process with
n threads (with some exceptions for disk IO and such like). There are also
mpi binaries for parallel processing which minimise the RAM requirements by sharing data between threads.
I.e., if you have 300 sequences and 32 cores, say, you’re best to launch 32 x 1 thread processes (one for each fasta), which you could do with
parallel as I mentioned before. As each 1 of the 32 finishes, the next can be started (
parallel takes care of this).
It is not clear whether you want to submit your sequences to the online server or to run them locally. For online submissions, I think that you will need to submit them one at a time. The rest of this writing is about local submissions.
There is a nice perl script called
multithread.pl that comes with HHsuite. If you defined the
$HHLIB properly as explained in instructions, you should be able to find its usage by typing the command name. Assuming you have 8 threads, it could go something like this:
multithread.pl '*.fas' 'hhblits -i $file -o $base.hhr -d /database/location/db_name -v 0 -cpu 2' --cpu 4
First single quotation marks specify the group of files on which you want to run the command, and the command itself is specified in second single quotation marks. Variable
$file will contain individual file names, while
$base refers to those same names with their extension (.fas in this case) truncated. In this case I am asking hhblits to use 2 CPUs, and
--cpu 4 at the end means that 4 simultaneous processes may be run (4*2 is the maximum number of threads I assumed to be available). The script will monitor the processes and it will start a new one as soon as one of the previous jobs has finished.
Couple of things to note: hhblits is bound by memory maybe more so than by number of threads. So even if you have 32 CPUs and could run 8 parallel jobs with 4 threads each, you may not be able to fit all that in memory. I have access to a computer with 40 CPUs and 256 Gb of memory, yet I don't run more than 4 parallel hhblits jobs with 8 CPUs each, at least not with most recent hhblits databases. Also, it is not by accident that
-cpu 2 is hhblits default. In my experience, using 2 CPUs is the most optimal number in terms of efficiency.