Tool: Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them
66
gravatar for ole.tange
2 days ago by
ole.tange1.1k
ole.tange1.1k wrote:

Article describing tool (for citations):

O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.

Authors' website for obtaining code:

http://www.gnu.org/software/parallel/

All new computers have multiple cores. Many bioinformatics tools are serial in nature and will therefore not use the multiple cores. However, many bioinformatics tasks (especially within NGS) are extremely parallelizeable:

  • Run the same program on many files
  • Run the same program on every sequence

GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.

If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:

Simple scheduling

GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:

GNU Parallel scheduling

Installation

A personal installation does not require root access. It can be done in 10 seconds by doing this:

(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash

For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README

EXAMPLE: Replace a for-loop

It is often faster to write a command using GNU Parallel than making a for loop:

for i in *gz; do 
  zcat $i > $(basename $i .gz).unpacked
done

can be written as:

parallel 'zcat {} > {.}.unpacked' ::: *.gz

The added benefit is that the zcats are run in parallel - one per CPU core.

EXAMPLE: Parallelizing BLAT

This will start a blat process for each processor and distribute foo.fa to these in 1 MB blocks:

cat foo.fa | parallel --round-robin --pipe --recstart ">" "blat -noHead genome.fa stdin >(cat) >&2" >foo.psl

EXAMPLE: Blast on multiple machines

Assume you have a 1 GB fasta file that you want blast, GNU Parallel can then split the fasta file into 100 KB chunks and run 1 jobs per CPU core:

cat 1gb.fasta | parallel --block 100k --recstart '>' --pipe blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > results

If you have access to the local machine, server1 and server2, GNU Parallel can distribute the jobs to each of the servers. It will automatically detect how many CPU cores are on each of the servers:

cat 1gb.fasta | parallel -S :,server1,server2 --block 100k --recstart '>' --pipe blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > result

EXAMPLE: Run bigWigToWig for each chromosome

If you have one file per chomosome it is easy to parallelize processing each file. Here we do bigWigToWig for chromosome 1..19 + X Y M. These will run in parallel but only one job per CPU core. The {} will be substituted with arguments following the separator ':::'.

parallel bigWigToWig -chrom=chr{} wgEncodeCrgMapabilityAlign36mer_mm9.bigWig mm9_36mer_chr{}.map ::: {1..19} X Y M

EXAMPLE: Running composed commands

GNU Parallel is not limited to running a single command. It can run a composed command. Here is now you process multiple FASTA files using Biopieces (which uses pipes to communicate):

parallel 'read_fasta -i {} | extract_seq -l 5 | write_fasta -o {.}_trim.fna -x' ::: *.fna

See also: http://code.google.com/p/biopieces/wiki/HowTo#Howto_use_Biopieces_with_GNU_Parallel

EXAMPLE: Running experiments

Experiments often have several parameters where every combination should be tested. Assume we have a program called experiment that takes 3 arguments: --age --sex --chr:

experiment --age 18 --sex M --chr 22

Now we want to run experiment for every combination of ages 1..80, sex M/F, chr 1..22+XY:

parallel experiment --age {1} --sex {2} --chr {3} ::: {1..80} ::: M F ::: {1..22} X Y

To save the output in different files you could do:

parallel experiment --age {1} --sex {2} --chr {3} '>' output.{1}.{2}.{3} ::: {1..80} ::: M F ::: {1..22} X Y

But GNU Parallel can structure the output into directories so you avoid having thousands of output files in a single dir:

parallel --results outputdir experiment --age {1} --sex {2} --chr {3} ::: {1..80} ::: M F ::: {1..22} X Y

This will create files like outputdir/1/80/2/M/3/X/stdout containing the standard output of the job.

If you have many different parameters it may be handy to name them:

parallel --result outputdir --header : experiment --age {AGE} --sex {SEX} --chr {CHR} ::: AGE {1..80} ::: SEX M F ::: CHR {1..22} X Y

Then the output files will be named like outputdir/AGE/80/CHR/Y/SEX/F/stdout

If one of your parameters take on many different values, these can be read from a file using '::::'

echo AGE > age_file
seq 1 80 >> age_file
parallel --results outputdir --header : experiment --age {AGE} --sex {SEX} --chr {CHR} :::: age_file ::: SEX M F ::: CHR {1..22} X Y

EXAMPLE(advanced): Using GNU Parallel to parallelize you own scripts

Assume you have BASH/Perl/Python script called launch. It takes one arguments, ID:

launch ID

Using parallel you can run multiple IDs in parallel using:

parallel launch ::: ID1 ID2 ...

But you would like to hide this complexity from the user, so the user only has to do:

launch ID1 ID2 ...

You can do that using --shebang-wrap. Change the shebang line from:

#!/usr/bin/env bash
#!/usr/bin/env perl
#!/usr/bin/env python

to:

#!/usr/bin/parallel --shebang-wrap bash
#!/usr/bin/parallel --shebang-wrap perl
#!/usr/bin/parallel --shebang-wrap python

You further develop your script so it now takes an ID and a DIR:

launch ID DIR

You would like it to take multiple IDs but only one DIR, and run the IDs in parallel. Again just change the shebang line to:

#!/usr/bin/parallel --shebang-wrap bash

And now you can run:

launch ID1 ID2 ID3 ::: DIR

Learn more

See more examples: http://www.gnu.org/software/parallel/man.html

Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

#ilovefs

If you like GNU Parallel:

  • Give a demo at your local user group/team/colleagues
  • Post the intro videos on Reddit/Diaspora*/forums/blogs/ Identi.ca/Google+/Twitter/Facebook/Linkedin/mailing lists
  • Get the merchandise https://www.gnu.org/s/parallel/merchandise.html
  • Request or write a review for your favourite blog or magazine
  • Request or build a package for your favourite distribution (if it is not already there)
  • Invite me for your next conference

If you use GNU Parallel for research:

  • Please cite GNU Parallel in you publications (use --bibtex)

If GNU Parallel saves you money:

ADD COMMENTlink modified 2 days ago by madismetsis0 • written 14 months ago by ole.tange1.1k
3

Excellent examples. I've been using GNU Parallel for a while, but I learned a lot by reading this. Thanks for posting, and for the videos (those really helped me get off the ground with Parallel).

ADD REPLYlink written 14 months ago by SES4.3k
1

Great. On a cluster, does one need to acquire/assign the cores to be used using MPI/smp first or can just run parallel without it.

ADD REPLYlink written 14 months ago by Sukhdeep Singh4.6k
1

That would depend on the rules for your cluster. Default for GNU Parallel is to spawn one process per cpu core.

ADD REPLYlink written 14 months ago by ole.tange1.1k

Its Rocks running JAVA SGE, I will test and see. Cheers

ADD REPLYlink written 14 months ago by Sukhdeep Singh4.6k
1

@ole.tange maybe you could briefly explain why parallel is superior to a for loop - aside from the shorter syntax.

ADD REPLYlink written 14 months ago by Martin A Hansen2.7k
2

The for loop executes the commands one at a time. Parallel can use multiple processors to run them in parallel.

ADD REPLYlink written 14 months ago by Giovanni M Dall'Olio17k
2

A for loop can start many jobs simultaneously by putting them in the background -> for i in *; do cat $i & done; - but that way you may start 1000 jobs which is probably inefficient. Parallel does some clever load balancing.

ADD REPLYlink written 14 months ago by Martin A Hansen2.7k
1

What about dealing with the many bioinformatics tools that are do not accept streams as input and insist on reading files instead? (e.g. blat I think). Is there an easy way to autogenerate and delete such files within a single line of "parallel"?

ADD REPLYlink written 13 months ago by Yannick Wurm1.8k
4

You can use named pipes to stream data to placeholder files, which can then be used with tools that do not read streams: http://en.wikipedia.org/wiki/Named_pipe

ADD REPLYlink written 13 months ago by Alex Reynolds6.0k

Wow - very cool!

ADD REPLYlink written 12 months ago by Yannick Wurm1.8k
2

In these cases, what I do is to write a wrapper script, which generates any parameter file needed for running the script.

ADD REPLYlink written 13 months ago by Giovanni M Dall'Olio17k
2

One solution is to create a file: cat file | parallel --pipe "cat >{#}; my_program {#}; rm {#}". Alex suggests using named pipes - which is more efficient, but does not work with every tool: cat file | parallel --pipe "mkfifo {#}; my_program {#} & cat >{#};rm {#}"

ADD REPLYlink modified 13 months ago • written 13 months ago by ole.tange1.1k

Hey Ole, how to bypass awk quotes, example (counting the reads in fastq files)

parallel 'echo && gunzip -c | wc -l | awk \'{print $1/4}\'' ::: *fastq.gz wont work

ADD REPLYlink written 13 months ago by Sukhdeep Singh4.6k
4

Hi Sukhdeep, the following worked for me: parallel "echo {} && gunzip -c {} | wc -l | awk '{d=\$1; print d/4;}'" ::: *.gz

ADD REPLYlink modified 13 months ago • written 13 months ago by Alex Reynolds6.0k

Super Alex, it works :)

ADD REPLYlink written 13 months ago by Sukhdeep Singh4.6k
1

This is very very useful. Thanks for the concrete examples. BTW, about zcat, a multithreaded version of gzip exists, it is called "pigz" ;-)

ADD REPLYlink written 12 months ago by Manu Prestat2.8k

pigz (http://zlib.net/pigz/) is simple and can save a very significant amount of time if you have a lot of threads available.

ADD REPLYlink written 11 months ago by Eric Normandeau7.1k

This is just wonderful. As you have mentioned above GNU Parallel to parallelize you own scripts which can be bash/python/perl etc which can take multiple IDs (i.e, arguments) at a single go. Does it do the other way so? which taking a single argument and run it in multiple cores of the computer???

ADD REPLYlink written 20 days ago by tarakaramji0
2

How would you run a single argument on multiple cores?

ADD REPLYlink written 20 days ago by ole.tange1.1k
8
gravatar for Pierre Lindenbaum
6 months ago by
France
Pierre Lindenbaum58k wrote:

I put my notebook about GNU parallel on figshare:

http://figshare.com/articles/GNU_parallel_for_Bioinformatics_my_notebook/822138

My document follows Ole Tange’s GNU parallel tutorial ( http://www.gnu.org/software/parallel/parallel_tutorial.html ) but I tried to use some bioinformatics-related examples (align with BWA, Samtools, etc.. ).

ADD COMMENTlink written 6 months ago by Pierre Lindenbaum58k
1

This is awesome!

ADD REPLYlink written 6 months ago by Manu Prestat2.8k
1

Thank you Pierre! Do you think it would be possible to have a version with bigger fonts? At 125%, it is difficult to read.

ADD REPLYlink modified 6 months ago • written 6 months ago by Eric Normandeau7.1k
3
gravatar for Giovanni M Dall'Olio
14 months ago by
London, UK
Giovanni M Dall'Olio17k wrote:

EXAMPLE: Coalescent simulations using COSI

This script can be used to launch multiple cosi simulations using the GNU/parallel tool, or a Sun Grid Engine environment.

This is how I launch it:

$: seq 1 100 | parallel ./launch_single_cosi_iteration.sh {} outputfolder

This is an home made script that I wrote to execute a one time task. I've tried to adapt it for a general case and to improve the documentation, but I didn't spent much time on it. Please ask me if it doesn't work or if you have any doubt.


EXAMPLE: convert all PED files in a directory to BED

This is an example of how GNU/parallel can be used in combination with plink (or vcftools) to execute tasks on a set of data files. Note how "{.}" takes the value of the file name without the extension.

find . -type f -maxdepth 1 -iname "*ped" | parallel "plink --make-bed --noweb --file {.} --out {.}"
ADD COMMENTlink modified 14 months ago • written 14 months ago by Giovanni M Dall'Olio17k
2

Since you do not use UNIX special chars (such as | * > &) in your command the " are not needed.

ADD REPLYlink written 14 months ago by ole.tange1.1k

thank you, I didn't know that :-)

ADD REPLYlink written 14 months ago by Giovanni M Dall'Olio17k
3
gravatar for lh3
14 months ago by
lh320k
lh320k wrote:

All the clusters I use require to use SGE/LSF. My understanding is that parallel does not support SGE/LSF (correct me if I am wrong). I would recommend a general way to compose multiple command lines as:

seq 22 | xargs -i echo samtools cmd -r chr{} aln.bam | parallel -j 5
ls *.bed | sed s,.bed,, | xargs -i echo mv {}.bed {}.gff | sh
ls *.psmcfa | sed s,.psmcfa,, | xargs -i echo psmc {}.psmcfa \> {}.psmc \& | sh
ls *.psmcfa | sed s,.psmcfa,, | xargs -i echo bsub 'psmc -o {}.psmc {}.psmcfa' | sh

For the last command line to submit jobs to LSF, I more often use my asub script:

ls *.psmcfa | sed s,.psmcfa,, | xargs -i echo psmc -o {}.psmc {}.psmcfa | asub

You can see from the above examples that the general pattern is to feed a list to xargs -i echo to let it print the commands to stdout. The last command after pipe | can be sh if you know they run fast, parallel if you want to control the number of concurrent jobs on the same machine, or asub if you want to submit to LSF/SGE. There are also a few variants: e.g. put & or bsub in the echo command. With xargs -i echo, you will not be bound to the parallel grammar. Another advantage is for complex command lines, you can pipe it to more to visually check if the commands are correct. At least for me, I frequently see problems from more before submitting thousands of jobs.

ADD COMMENTlink written 14 months ago by lh320k
2

Here are your examples using GNU Parallel:

seq 22 | parallel -j5 samtools cmd -r chr{} aln.bam
ls *.bed | parallel mv {} {.}.gff
ls *.psmcfa | parallel psmc {} \> {.}.psmc
ls *.psmcfa | parallel bsub psmc -o {.}.psmc {}
ls *.psmcfa | parallel echo psmc -o {.}.psmc {} | asub

It is shorter and IMHO easier to read.

You can use --dry-run if you want to see what would be done.

ls *.psmcfa | parallel --dry-run psmc {} \> {.}.psmc
ADD REPLYlink modified 14 months ago • written 14 months ago by ole.tange1.1k
1

I have know the basic of parallel for some time. My problem is it is not a standard tool. None of the machines I use in multiple institutes have that. Sed/awk/perl/xargs exist in every Unix distribution. My point is to learn to construct command lines in general. There may be more complicated cases you cannot do with one parallel.

ADD REPLYlink modified 14 months ago • written 14 months ago by lh320k
4

As long as you are allowed to run your own scripts you can run GNU Parallel. 10 seconds installation: 'wget -O - pi.dk/3 | bash'. Read Minimal installation in http://git.savannah.gnu.org/cgit/parallel.git/tree/README

The examples you provide deal badly with spaces in the filenames. Using GNU Parallel this is no longer an issue.

ADD REPLYlink modified 11 months ago • written 14 months ago by ole.tange1.1k
2

These are both good points. There are good arguments to make for both cases. In one hand, you don't want to clutter your pipelines with tools and binaries that are not minimally standard. On the other hand, good tools will eventually become a standard (this tool may be an example). Somewhat related, I think this project (shameless plug): https://github.com/drio/bio.brew can help mitigate the management of software tools. Specially useful when you don't have root access to the boxes where you do your analysis.

ADD REPLYlink modified 17 days ago • written 18 days ago by Drio840
3
gravatar for Alex Reynolds
5 weeks ago by
Alex Reynolds6.0k
Seattle, WA USA
Alex Reynolds6.0k wrote:

EXAMPLE: Using multiple SSH-capable hosts to efficiently generate a highly-compressed BED archive

For labs without an SGE installation but lots of quiet hosts running an SSH service and BEDOPS tools, we can use GNU Parallel to quickly generate per-chromosome, highly-compressed archives of an input BED file, stitching them together at the end into one complete Starch archive.

This archival process can reduce a four-column BED file to about 5% of its original size, while preserving the ability to do memory-efficient, high-performance and powerful multi-set and statistical operations with bedops, bedmap, bedextract and closest-features:

$ PARALLEL_HOSTS=foo,bar,baz
$ bedextract --list-chr input.bed \
    | parallel \
        --sshlogin $PARALLEL_HOSTS \
        "bedextract {} input.bed | starch - > input.{}.starch"
$ starchcat input.*.starch > input.starch
$ rm input.*.starch

Once the archive is made, it can be operated on directly, just like a BED file, e.g.:

$ bedops --chrom chrN --element-of -1 input.starch another.bed
...

We have posted a GNU Parallel-based variant of our *starchcluster* script, part of the BEDOPS toolkit, which uses some of the above code to facilitate using multiple hosts to efficiently parallelize the process of making highly-compressed and operable Starch archives out of BED inputs.

ADD COMMENTlink modified 5 weeks ago • written 14 months ago by Alex Reynolds6.0k
1

I am trying to figure out why you need --max-lines 1 --keep-order. Would it not work just as well without? Also why not use GNU Parallel's automation to figure out the number of cores instead of forcing $MAX_JOBS in parallel? (and you miss a | before parallel)

ADD REPLYlink modified 14 months ago • written 14 months ago by ole.tange1.1k
1

You're right — assuming that defaults do not change, then it isn't necessary to be explicit. Thanks for the pipe catch.

ADD REPLYlink written 14 months ago by Alex Reynolds6.0k
0
gravatar for madismetsis
2 days ago by
madismetsis0 wrote:

First, GNU parallel has the best installability of everything I have seen in my live. And all examples worked - gunzips, blast and blasts etc. However, I got stuck on one thing. I have a simple perl script called transparal.pl. It makes soemthing to a file that is provided as argument. And the original one behaves as expected after chmod 700

$ ./transparal_old.pl 
Could not open file! at ./transparal_old.pl line 5. ( did not give the input file name!)

Then I changed the shebang to

#!/usr/bin/parallel --shebang-wrap perl

and ...

$ ./transparal.pl 
-bash: ./transparal.pl: /usr/bin/parallel: bad interpreter: No such file or directory

cheking GNU parallel

$ which parallel
/usr/local/bin/parallel

Looks ok. Confused.

ADD COMMENTlink modified 2 days ago by Michael Dondrup27k • written 2 days ago by madismetsis0
1

look at #!/usr/bin/parallel vs. where it really is: #!/usr/local/bin/parallel; could also try this instead #!/usr/bin/env parallel, then it will just take parallel from your PATH, that's possibly the (mostly) portable way. However, I am not sure if the /usr/bin/env way handles parameters to the program, edit: see http://stackoverflow.com/questions/4303128/how-to-use-multiple-arguments-with-a-shebang-i-e which in conclusion means you can on many systems use only the correct absolute path (and #!/usr/bin/env parallel --argstothe program most likely will not work).

ADD REPLYlink modified 2 days ago • written 2 days ago by Michael Dondrup27k
Please log in to add an answer.

Help
Access
  • RSS
  • Stats
  • API

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.0.0
Traffic: 441 users visited in the last hour