Forum:Got Multiple Cpus? Activate Them With Gnu Parallel
Entering edit mode
10.6 years ago
ole.tange ★ 4.4k

All new computers have multiple cores. Many bioinformatics tools are serial in nature and will therefore not use the multiple cores. However, many bioinformatics tasks (especially within NGS) are extremely parallelizeable:

  • Run the same program on many files
  • Run the same program on every sequence

GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.

If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:

Simple scheduling

GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:

GNU Parallel scheduling

EXAMPLE: Replace a for-loop

It is often faster to write a command using GNU Parallel than making a for loop:

for I in *gz; do 
  zcat $i > $(basename $i .gz).unpacked

can be written as:

parallel 'zcat {} > {.}.unpacked' ::: *.gz

The added benefit is that the zcats are run in parallel - one per CPU core.

EXAMPLE: Blast on multiple machines

Assume you have a 1 GB fasta file that you want blast, GNU Parallel can then split the fasta file into 100 KB chunks and run 1 jobs per CPU core:

cat 1gb.fasta | parallel --block 100k --recstart '>' --pipe blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > results

If you have access to the local machine, server1 and server2, GNU Parallel can distribute the jobs to each of the servers. It will automatically detect how many CPU cores are on each of the servers:

cat 1gb.fasta | parallel -S :,server1,server2 --block 100k --recstart '>' --pipe blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > result

EXAMPLE: Running experiments

Experiments often have several parameters where every combination should be tested. Assume we have a program called experiment that takes 3 arguments: --age --sex --chr:

experiment --age 18 --sex M --chr 22

Now we want to run experiment for every combination of ages 1..80, sex M/F, chr 1..22+XY:

parallel experiment --age {1} --sex {2} --chr {3} ::: {1..80} ::: M F ::: {1..22} X Y

To save the output in different files you could do:

parallel experiment --age {1} --sex {2} --chr {3} '>' output.{1}.{2}.{3} ::: {1..80} ::: M F ::: {1..22} X Y

But GNU Parallel can structure the output into directories so you avoid having thousands of output files in a single dir:

parallel --results outputdir experiment --age {1} --sex {2} --chr {3} ::: {1..80} ::: M F ::: {1..22} X Y

This will create files like outputdir/1/80/2/M/3/X/stdout containing the standard output of the job.

If you have many different parameters it may be handy to name them:

parallel --result outputdir --header : experiment --age {AGE} --sex {SEX} --chr {CHR} ::: AGE {1..80} ::: SEX M F ::: CHR {1..22} X Y

Then the output files will be named like outputdir/AGE/80/CHR/Y/SEX/F/stdout

If one of your parameters take on many different values, these can be read from a file using '::::'

echo AGE > age_file
seq 1 80 >> age_file
parallel --results outputdir --header : experiment --age {AGE} --sex {SEX} --chr {CHR} :::: age_file ::: SEX M F ::: CHR {1..22} X Y

Learn more

See more examples:

Watch the intro videos:

Walk through the tutorial: Your command line will love you for it.

Sign up for the email list to get support:

parallel command-line ngs • 8.6k views
Entering edit mode

I may have run into a bug with the current version:

$ wget http://.../gnu/parallel/parallel-latest.tar.bz2
$ tar jxvf parallel-latest.tar.bz2
$ cd parallel-20140322
$ ./configure --prefix=/usr/local
$ make
$ sudo make install
$ parallel --help
-bash: /usr/bin/parallel: No such file or directory
$ parallel --version
-bash: /usr/bin/parallel: No such file or directory

This is being done on an Ubuntu 12.04 host, running GCC 4.8.1:

$ uname -a
Linux foo 3.11.0-18-generic #32~precise1-Ubuntu SMP Thu Feb 20 17:52:10 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
$ gcc --version
gcc (Ubuntu 4.8.1-2ubuntu1~12.04) 4.8.1
$ g++ --version
g++ (Ubuntu 4.8.1-2ubuntu1~12.04) 4.8.1
Entering edit mode

Why is bash looking in /usr/bin when you installed in /usr/local/bin? Try:

ls -l /usr/bin/parallel*  /usr/local/bin/parallel*

And try:

set +h
Entering edit mode

I'm not sure why bash is looking in /usr/bin, but I'm not sure if setting up a symbolic link is the right solution. I'll try compiling an older version some time next week.


Login before adding your answer.

Traffic: 3170 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6