Tool:Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them
9
275
Entering edit mode
8.2 years ago
ole.tange ★ 4.0k

Article describing tool (for citations):

O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.

Author's website for obtaining code:

http://www.gnu.org/software/parallel/

All new computers have multiple cores. Many bioinformatics tools are serial in nature and will therefore not use the multiple cores. However, many bioinformatics tasks (especially within NGS) are extremely parallelizeable:

• Run the same program on many files
• Run the same program on every sequence

GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.

If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:

GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:

Installation

A personal installation does not require root access. It can be done in 10 seconds by doing this:

(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash


For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README

EXAMPLE: Replace a for-loop

It is often faster to write a command using GNU Parallel than making a for loop:

for i in *gz; do
zcat $i >$(basename $i .gz).unpacked done  can be written as: parallel 'zcat {} > {.}.unpacked' ::: *.gz  The added benefit is that the zcats are run in parallel - one per CPU core. EXAMPLE: Parallelizing BLAT This will start a blat process for each processor and distribute foo.fa to these in 1 MB blocks: cat foo.fa | parallel --round-robin --pipe --recstart '>' 'blat -noHead genome.fa stdin >(cat) >&2' >foo.psl  EXAMPLE: Blast on multiple machines Assume you have a 1 GB fasta file that you want blast, GNU Parallel can then split the fasta file into 100 KB chunks and run 1 jobs per CPU core: cat 1gb.fasta | parallel --block 100k --recstart '>' --pipe blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > results  If you have access to the local machine, server1 and server2, GNU Parallel can distribute the jobs to each of the servers. It will automatically detect how many CPU cores are on each of the servers: cat 1gb.fasta | parallel -S :,server1,server2 --block 100k --recstart '>' --pipe blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > result  EXAMPLE: Run bigWigToWig for each chromosome If you have one file per chomosome it is easy to parallelize processing each file. Here we do bigWigToWig for chromosome 1..19 + X Y M. These will run in parallel but only one job per CPU core. The {} will be substituted with arguments following the separator ':::'. parallel bigWigToWig -chrom=chr{} wgEncodeCrgMapabilityAlign36mer_mm9.bigWig mm9_36mer_chr{}.map ::: {1..19} X Y M  EXAMPLE: Running composed commands GNU Parallel is not limited to running a single command. It can run a composed command. Here is now you process multiple FASTA files using Biopieces (which uses pipes to communicate): parallel 'read_fasta -i {} | extract_seq -l 5 | write_fasta -o {.}_trim.fna -x' ::: *.fna  EXAMPLE: Running experiments Experiments often have several parameters where every combination should be tested. Assume we have a program called experiment that takes 3 arguments: --age --sex --chr: experiment --age 18 --sex M --chr 22  Now we want to run experiment for every combination of ages 1..80, sex M/F, chr 1..22+XY: parallel experiment --age {1} --sex {2} --chr {3} ::: {1..80} ::: M F ::: {1..22} X Y  To save the output in different files you could do: parallel experiment --age {1} --sex {2} --chr {3} '>' output.{1}.{2}.{3} ::: {1..80} ::: M F ::: {1..22} X Y  But GNU Parallel can structure the output into directories so you avoid having thousands of output files in a single dir: parallel --results outputdir experiment --age {1} --sex {2} --chr {3} ::: {1..80} ::: M F ::: {1..22} X Y  This will create files like outputdir/1/80/2/M/3/X/stdout containing the standard output of the job. If you have many different parameters it may be handy to name them: parallel --result outputdir --header : experiment --age {AGE} --sex {SEX} --chr {CHR} ::: AGE {1..80} ::: SEX M F ::: CHR {1..22} X Y  Then the output files will be named like outputdir/AGE/80/CHR/Y/SEX/F/stdout If you want the output in a CSV/TSV-file that you can read into R or LibreOffice Calc, simply point --result to a file ending in .csv/.tsv: parallel --result output.tsv --header : experiment --age {AGE} --sex {SEX} --chr {CHR} ::: AGE {1..80} ::: SEX M F ::: CHR {1..22} X Y  It will deal correctly with newlines in the output, so they will be read as newlines in R or LibreOffice Calc. If one of your parameters take on many different values, these can be read from a file using '::::' echo AGE > age_file seq 1 80 >> age_file parallel --results outputdir --header : experiment --age {AGE} --sex {SEX} --chr {CHR} :::: age_file ::: SEX M F ::: CHR {1..22} X Y  If you have many experiments, it can be useful to see some experiments picked at random. Think of it as painting a picture by numbers: You can start from the top corner, or you can paint bits at random. If you paint bits at random, you will often see a pattern earlier, than if you painted in the structured way. With --shuf GNU Parallel will shuffle the experiments and run them all, but in random order: parallel --shuf --results outputdir --header : experiment --age {AGE} --sex {SEX} --chr {CHR} :::: age_file ::: SEX M F ::: CHR {1..22} X Y  EXAMPLE(advanced): Using GNU Parallel to parallelize you own scripts Assume you have BASH/Perl/Python script called launch. It takes one arguments, ID: launch ID  Using parallel you can run multiple IDs in parallel using: parallel launch ::: ID1 ID2 ...  But you would like to hide this complexity from the user, so the user only has to do: launch ID1 ID2 ...  You can do that using --shebang-wrap. Change the shebang line from: #!/usr/bin/env bash #!/usr/bin/env perl #!/usr/bin/env python  to: #!/usr/bin/parallel --shebang-wrap bash #!/usr/bin/parallel --shebang-wrap perl #!/usr/bin/parallel --shebang-wrap python  You further develop your script so it now takes an ID and a DIR: launch ID DIR  You would like it to take multiple IDs but only one DIR, and run the IDs in parallel. Again just change the shebang line to: #!/usr/bin/parallel --shebang-wrap bash  And now you can run: launch ID1 ID2 ID3 ::: DIR  Learn more See more examples: http://www.gnu.org/software/parallel/man.html Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1 Walk through the tutorial once a year - your command line will love you for it: http://www.gnu.org/software/parallel/parallel_tutorial.html Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel #ilovefs If you like GNU Parallel: • Give a demo at your local user group/team/colleagues (remember to show them --bibtex) • Post the intro videos on Reddit/Diaspora*/forums/blogs/ Identi.ca/Google+/Twitter/Facebook/Linkedin/mailing lists • Get the merchandise https://www.gnu.org/s/parallel/merchandise.html • Request or write a review for your favourite blog or magazine • Request or build a package for your favourite distribution (if it is not already there) • Invite me for your next conference When using programs that use GNU Parallel to process data for publication you should cite as per parallel --citation. If you prefer not to cite, contact me. If GNU Parallel saves you money: tools parallel next-gen ngs Tool • 123k views ADD COMMENT 4 Entering edit mode Excellent examples. I've been using GNU Parallel for a while, but I learned a lot by reading this. Thanks for posting, and for the videos (those really helped me get off the ground with Parallel). ADD REPLY 2 Entering edit mode This is very very useful. Thanks for the concrete examples. BTW, about zcat, a multithreaded version of gzip exists, it is called "pigz" ;-) ADD REPLY 0 Entering edit mode pigz (http://zlib.net/pigz/) is simple and can save a very significant amount of time if you have a lot of threads available. ADD REPLY 2 Entering edit mode First, GNU parallel has the best installability of everything I have seen in my live. And all examples worked - gunzips, blast and blasts etc. However, I got stuck on one thing. I have a simple perl script called transparal.pl. It makes something to a file that is provided as argument. And the original one behaves as expected after chmod 700 $ ./transparal_old.pl
Could not open file! at ./transparal_old.pl line 5. ( did not give the input file name!)


Then I changed the shebang to

#!/usr/bin/parallel --shebang-wrap perl


and ...

$./transparal.pl -bash: ./transparal.pl: /usr/bin/parallel: bad interpreter: No such file or directory  checking GNU parallel $ which parallel
/usr/local/bin/parallel


Looks ok. Confused.

2
Entering edit mode

look at #!/usr/bin/parallel vs. where it really is: #!/usr/local/bin/parallel; could also try this instead #!/usr/bin/env parallel, then it will just take parallel from your PATH, that's possibly the (mostly) portable way. However, I am not sure if the /usr/bin/env way handles parameters to the program, edit: see http://stackoverflow.com/questions/4303128/how-to-use-multiple-arguments-with-a-shebang-i-e which in conclusion means you can on many systems use only the correct absolute path (and #!/usr/bin/env parallel --argstothe program most likely will not work).

0
Entering edit mode

I wonder if recent builds of the binary expect it to be in /usr/bin. Still trying to troubleshoot a similar problem.

1
Entering edit mode

I have built my own parallel and installed it in $HOME/bin which is in my PATH, worked fine for me. ADD REPLY 1 Entering edit mode Great. On a cluster, does one need to acquire/assign the cores to be used using MPI/smp first or can just run parallel without it. ADD REPLY 1 Entering edit mode That would depend on the rules for your cluster. Default for GNU Parallel is to spawn one process per cpu core. ADD REPLY 0 Entering edit mode Its Rocks running JAVA SGE, I will test and see. Cheers ADD REPLY 1 Entering edit mode @ole.tange maybe you could briefly explain why parallel is superior to a for loop - aside from the shorter syntax. ADD REPLY 2 Entering edit mode The for loop executes the commands one at a time. Parallel can use multiple processors to run them in parallel. ADD REPLY 2 Entering edit mode A for loop can start many jobs simultaneously by putting them in the background -> for i in *; do cat$i & done; - but that way you may start 1000 jobs which is probably inefficient. Parallel does some clever load balancing.

1
Entering edit mode

What about dealing with the many bioinformatics tools that are do not accept streams as input and insist on reading files instead? (e.g. blat I think). Is there an easy way to autogenerate and delete such files within a single line of "parallel"?

6
Entering edit mode

You can use named pipes to stream data to placeholder files, which can be used with some tools that do not read streams: http://en.wikipedia.org/wiki/Named_pipe

0
Entering edit mode

Wow - very cool!

0
Entering edit mode

amazing, upvote

2
Entering edit mode

In these cases, what I do is to write a wrapper script, which generates any parameter file needed for running the script.

2
Entering edit mode

One solution is to create a file: cat file | parallel --pipe "cat >{#}; my_program {#}; rm {#}". Alex suggests using named pipes - which is more efficient, but does not work with every tool: cat file | parallel --pipe "mkfifo {#}; my_program {#} & cat >{#};rm {#}"

0
Entering edit mode

Hey Ole, how to bypass awk quotes, example (counting the reads in fastq files)

parallel 'echo && gunzip -c | wc -l | awk \'{print $1/4}\'' ::: *fastq.gz wont work ADD REPLY 4 Entering edit mode Hi Sukhdeep, the following worked for me: parallel "echo {} && gunzip -c {} | wc -l | awk '{d=\$1; print d/4;}'" ::: *.gz

0
Entering edit mode

Super Alex, it works :)

0
Entering edit mode

Use --fifo:

cat file | parallel --fifo --pipe wc {}


Or --cat:

cat file | parallel --cat --pipe wc {}

1
Entering edit mode

How do we put wait command between different parallel runs ? I have a script that performs multiple jobs in order.

parallel jobs 1
wait
parallel jobs 2
...etc


Will that work?

0
Entering edit mode

why do you need to call wait ?

0
Entering edit mode

I am thinking that everything starts parallelly. I have to wait until the Jobs 1 finishes and then start jobs 2.

0
Entering edit mode

use GNU make with option -j

0
Entering edit mode

Maybe explore GNU Parallel's semaphore options: http://www.gnu.org/software/parallel/man.html#EXAMPLE:-Working-as-mutex-and-counting-semaphore

(Though a make-based process as Pierre suggests, or dedicated job scheduler is probably going to be easier to maintain.)

0
Entering edit mode

It will work, but there is really no need to call wait. GNU Parallel does that automatically. Try:

parallel 'sleep {};echo Jobslot {%} slept {} seconds' ::: 4 3 2 1
seq 5 -.1 0 | parallel 'sleep {};echo Jobslot {%} slept {} seconds'
seq 5 -.1 0 | parallel -j0 'sleep {};echo Jobslot {%} slept {} seconds'


You will see that GNU Parallel only finishes after the last job is done.

1
Entering edit mode

What is the best way to deal with the error below?

parallel "do something" ::: seq.*
-bash: /usr/local/bin/parallel: Argument list too long


Alternatively, if somebody could show me how to pipe to a command defined in a bash script, that would be just wonderful. Right now, I'm doing:

#split the multifasta into individual seqs
cat $NAME/file.fna | parallel --recstart '>' -N1 --pipe "cat - >$NAME/seq.{#}"
#do stuff with the split files
export -f blastFunction
parallel blastFunction ::: $NAME/seq.*  and blastFunction begins like this: blastFunction() { BLAST=$(blastn -query $1 -subject$1 -outfmt 6 -perc_identity 100)

2
Entering edit mode

Try giving a text file which lists the input files to parallel instead of direct arguments. You can do this via 4 colons (::::)

Get a list of files by:

ls *.txt > myFiles


Then do:

parallel "do something" :::: myFiles

0
Entering edit mode

Thanks for the suggestion, but

-bash: /bin/ls: Argument list too long

0
Entering edit mode

yes, if the argument list is too long for bash, it won't work with any command, where you let the shell glob the file names.

1
Entering edit mode

Example "EXAMPLE: convert all PED files in a directory to BED" should work for this using find, you seem to have too many txt files

1
Entering edit mode

Thanks, this one worked

find $NAME/ -type f -maxdepth 1 -iname "seq.*" | parallel blastFunction  Of course optimal solution would be if I could skip the part of creating thousands and thousands of files.. ADD REPLY 0 Entering edit mode you might try the >> operator next time you create files ADD REPLY 0 Entering edit mode Use Bash's builtin printf. Because it is builtin it is not subject to the same limit: printf "%s\0" seq.* | parallel -0 do something  ADD REPLY 1 Entering edit mode Hi All, I need some help to run my two samples to run freebayes using parallel. I saw this in the previous post but got confused: parallel --keep-order --max-procs 11 "freebayes --fasta-reference$REF \
--genotype-qualities --experimental-gls \
--region {} $BAM " :::$seqnames \
| vt normalize -r $REF - >$VCF


I am a bit confused I have two BAM files that need to be run with freebayes in one case I dont want to use vcffirstheader and vt normalise and in the second run I want to

Lets say the files are S1.bam and S2.bam and the reference is hg38.fa. Also do I need

--region {} $BAM  ? parallel --keep-order --max-procs 0 "freebayes --fasta-reference hg38.fa " ::: S1.bam S2.bam > output_1.vcf  with the vcffirst header and vt normalise would it look like: parallel --keep-order --max-procs 11 "freebayes --fasta-reference hg38.fa " ::: S1.bam S2.bam | vcffirstheader | vt normalize -r hg38.fa - > output_2.vcf  Can someone please help me? Thanks ADD REPLY 1 Entering edit mode Hi I have a burning question, I want to run a script named "predict_binding.py". Its syntax is: ./predict_binding.py [argA] [argB] [argC] ./file.txt  file.txt has a column of strings with the same length: string_1 string_2 string_3 ... string_n  predict_binding.py works with the first 3 arguments and string_1, then the 3 arguments and string_2, and so on. That's fine, but now I have m argB, and I want to test all of them. I want to use the cluster for this, and this looks like a perfect job for parallel, isn't it? After reading the manual and spending hours to try to make it work I realised I need some help. What works so far (and is trivial) is: parallel --verbose ./predict_binding ::: argA ::: argBi ::: argC ::: ./file.txt  This gives the same result as: ./predict_binding.py argA argBi argC ./file.txt  And indeed the flag --verbose says that the command looks like ./predict_binidng.py argA argBi argC ./file.txt but I want to test all arg2, so I made a file called args.txt, which looks like this: argA argB1 argC ./file.txt argA argB2 argC ./file.txt ... argA argBm argC ./file.txt  If I do: cat args.txt | parallel --verbose ./predict_binding.py {}  I get an error from ./predict_binding saying: predict_binding.py: error: incorrect number of arguments And verbose says that the command looks like: ./predict_binding.py argA\ argBi\ argC\ ./file.txt So, maybe those backslashes are affecting the input of ./predict_binding? How could I avoid them? I have tried using double and single quotations " ', backslash \, backslash with single quote \', none has work! I also tried: cat ./args.txt | parallel --verbose echo | ./predict_binding  Same error as above. And also I tried to use a function like: binding_func ( ) { ./predict_binding argA$1 argC ./file.txt}


Interestingly, binding_func works for:

parallel binding_func ::: argB1


But if I do:

parallel binding_func ::: argB1 argB2


It gives the result for one arg but fails (same error as above) for the other.

If I put only argB1 in the args.txt file and do:

cat args.txt | parallel --verbose binding_func {}


It fails miserably with the same error: predict_binding.py: error: incorrect number of arguments

It seems a very trivial and easy problem but I haven't been able to solve it }:(

I would appreciate very much any help provided. :)

1
Entering edit mode

parallel ./predict_binding.py argA argB argC :::: ./file.txt

0
Entering edit mode

This is just wonderful. As you have mentioned above GNU Parallel to parallelize you own scripts which can be bash/python/perl etc which can take multiple IDs (i.e, arguments) at a single go. Does it do the other way so? which taking a single argument and run it in multiple cores of the computer???

2
Entering edit mode

How would you run a single argument on multiple cores?

0
Entering edit mode

What options are available if you want to utilize other machines but they require a password for ssh? Is there a way to force using rsh?

0
Entering edit mode
0
Entering edit mode

Set up RSA keys for password-less SSH.

0
Entering edit mode
0
Entering edit mode

Thank you so much!!! This turned a 5.5 hour blast+ job into 25 minutes!

0
Entering edit mode

Hello ole.tange

In case of blast, I was wondering what is the difference between using -num_threads and using parallel because when when I use parallel and do top it shows all processes are blast but cpu% is at 99-100 while I use -num_threads it shows only one process is blast but the cpu% is 5900. (I have 60 cores in the server)

I am having confusion in comprehending the two ideas !!!

0
Entering edit mode

You should use which ever works faster for you.

0
Entering edit mode

I am creating a single .fastq.gz file from many .fastq.gz files with the following command

zcat 15_S15*.fastq.gz | gzip -c > combined_file.fastq.gz

Now, I want to do it with parallel command.

Anyone help me

1
Entering edit mode

furthermore: you don't need zcat |gzip ; see How To Merge Two Fastq.Gz Files?

0
Entering edit mode

19
Entering edit mode
7.5 years ago

I put my notebook about GNU parallel on figshare:

My document follows Ole Tange’s GNU parallel tutorial ( http://www.gnu.org/software/parallel/parallel_tutorial.html ) but I tried to use some bioinformatics-related examples (align with BWA, Samtools, etc.. ).

2
Entering edit mode

Thanks so much for sharing this, it's really useful. I notice that you don't include an example for sorting bam files using parallel. I'm trying that now:

sort_index_bam(){
outfile=echo $1 | sed -e 's/.bam/_sorted/g' samtools sort$1 $outfile index_file="$outfile.bam"
samtools index $outfile } export -f sort_index_bam parallel -j10 --progress --xapply sort_index_bam ::: ls -1 *.bam  And get the error (for example) [E::hts_open] fail to open file 'HiC_LY1_1_NoIndex_L003_034_sorted.bam' local:10/33/100%/42.8s [bam_sort_core] merging from 3 files...  Perhaps it's something to do with how parallel schedules its worker threads? The same scripted commands work fine on the command line in serial. I'm wondering if you have tried something similar. ADD REPLY 0 Entering edit mode Try this (requires parallel 20140822 or later): sort_index_bam(){ samtools sort "$1" "$2" samtools index "$2"
}
export -f sort_index_bam
parallel --bar sort_index_bam {} '{=s/.bam/_sorted.bam/=}' ::: *.bam

1
Entering edit mode

This is awesome!

1
Entering edit mode

Thank you Pierre! Do you think it would be possible to have a version with bigger fonts? At 125%, it is difficult to read.

9
Entering edit mode
8.2 years ago
lh3 32k

All the clusters I use require to use SGE/LSF. My understanding is that parallel does not support SGE/LSF (correct me if I am wrong). I would recommend a general way to compose multiple command lines as:

seq 22 | xargs -i echo samtools cmd -r chr{} aln.bam | parallel -j 5
ls *.bed | sed s,.bed,, | xargs -i echo mv {}.bed {}.gff | sh
ls *.psmcfa | sed s,.psmcfa,, | xargs -i echo psmc {}.psmcfa \> {}.psmc \& | sh
ls *.psmcfa | sed s,.psmcfa,, | xargs -i echo bsub 'psmc -o {}.psmc {}.psmcfa' | sh


For the last command line to submit jobs to LSF, I more often use my asub script:

ls *.psmcfa | sed s,.psmcfa,, | xargs -i echo psmc -o {}.psmc {}.psmcfa | asub


You can see from the above examples that the general pattern is to feed a list to xargs -i echo to let it print the commands to stdout. The last command after pipe | can be sh if you know they run fast, parallel if you want to control the number of concurrent jobs on the same machine, or asub if you want to submit to LSF/SGE. There are also a few variants: e.g. put & or bsub in the echo command. With xargs -i echo, you will not be bound to the parallel grammar. Another advantage is for complex command lines, you can pipe it to more to visually check if the commands are correct. At least for me, I frequently see problems from more before submitting thousands of jobs.

2
Entering edit mode

Here are your examples using GNU Parallel:

seq 22 | parallel -j5 samtools cmd -r chr{} aln.bam
ls *.bed | parallel mv {} {.}.gff
ls *.psmcfa | parallel psmc {} \> {.}.psmc
ls *.psmcfa | parallel bsub psmc -o {.}.psmc {}
ls *.psmcfa | parallel echo psmc -o {.}.psmc {} | asub


It is shorter and IMHO easier to read.

You can use --dry-run if you want to see what would be done.

ls *.psmcfa | parallel --dry-run psmc {} \> {.}.psmc

1
Entering edit mode

I have know the basic of parallel for some time. My problem is it is not a standard tool. None of the machines I use in multiple institutes have that. Sed/awk/perl/xargs exist in every Unix distribution. My point is to learn to construct command lines in general. There may be more complicated cases you cannot do with one parallel.

4
Entering edit mode

As long as you are allowed to run your own scripts you can run GNU Parallel. 10 seconds installation: 'wget -O - pi.dk/3 | bash'. Read Minimal installation in http://git.savannah.gnu.org/cgit/parallel.git/tree/README

The examples you provide deal badly with spaces in the filenames. Using GNU Parallel this is no longer an issue.

2
Entering edit mode

These are both good points. There are good arguments to make for both cases. In one hand, you don't want to clutter your pipelines with tools and binaries that are not minimally standard. On the other hand, good tools will eventually become a standard (this tool may be an example). Somewhat related, I think this project (shameless plug): https://github.com/drio/bio.brew can help mitigate the management of software tools. Specially useful when you don't have root access to the boxes where you do your analysis.

6
Entering edit mode
8.2 years ago

EXAMPLE: Using multiple SSH-capable hosts to efficiently generate a highly-compressed BED archive

For labs without an SGE installation but lots of quiet hosts running an SSH service and BEDOPS tools, we can use GNU Parallel to quickly generate per-chromosome, highly-compressed archives of an input BED file, stitching them together at the end into one complete Starch archive.

This archival process can reduce a four-column BED file to about 5% of its original size, while preserving the ability to do memory-efficient, high-performance and powerful multi-set and statistical operations with bedops, bedmap, bedextract and closest-features:

$PARALLEL_HOSTS=foo,bar,baz$ bedextract --list-chr input.bed \
| parallel \
--sshlogin $PARALLEL_HOSTS \ "bedextract {} input.bed | starch - > input.{}.starch"$ starchcat input.*.starch > input.starch
$rm input.*.starch  Once the archive is made, it can be operated on directly, just like a BED file, e.g.: $ bedops --chrom chrN --element-of -1 input.starch another.bed
...


We have posted a GNU Parallel-based variant of our starchcluster script, part of the BEDOPS toolkit, which uses some of the above code to facilitate using multiple hosts to efficiently parallelize the process of making highly-compressed and operable Starch archives out of BED inputs.

1
Entering edit mode

I am trying to figure out why you need --max-lines 1 --keep-order. Would it not work just as well without? Also why not use GNU Parallel's automation to figure out the number of cores instead of forcing $MAX_JOBS in parallel? (and you miss a | before parallel) ADD REPLY 1 Entering edit mode You're right - assuming that defaults do not change, then it isn't necessary to be explicit. Thanks for the pipe catch. ADD REPLY 0 Entering edit mode Is it possible to make parallel use rsh in the case where ssh requires a password? ADD REPLY 2 Entering edit mode Not sure, but you can certainly use RSA keys to SSH across hosts without a password. See: http://archive.news.softpedia.com/news/How-to-Use-RSA-Key-for-SSH-Authentication-38599.shtml ADD REPLY 0 Entering edit mode Yes. Normally a --sshlogin is simply a host name: server1  If you want to use another "ssh-command" than ssh, then you prepend with full path to the command: /usr/bin/rsh server1  Making it look like this: parallel -S "/usr/bin/rsh server1" do_stuff  But if possible it is a much better idea to use SSH with SSH-agent to avoid typing the passwords: https://wiki.dna.ku.dk/dokuwiki/doku.php?id=ssh_config ADD REPLY 3 Entering edit mode 8.2 years ago EXAMPLE: Coalescent simulations using COSI This script can be used to launch multiple cosi simulations using the GNU/parallel tool, or a Sun Grid Engine environment. This is how I launch it: $: seq 1 100 | parallel ./launch_single_cosi_iteration.sh {} outputfolder


This is an home made script that I wrote to execute a one time task. I've tried to adapt it for a general case and to improve the documentation, but I didn't spent much time on it. Please ask me if it doesn't work or if you have any doubt.

EXAMPLE: convert all PED files in a directory to BED

This is an example of how GNU/parallel can be used in combination with plink (or vcftools) to execute tasks on a set of data files. Note how {.} takes the value of the file name without the extension.

find . -type f -maxdepth 1 -iname "*ped" | parallel "plink --make-bed --noweb --file {.} --out {.}"

2
Entering edit mode

Since you do not use UNIX special chars (such as | * > &) in your command the " are not needed.

0
Entering edit mode

thank you, I didn't know that :-)

3
Entering edit mode
6.7 years ago
brentp 23k

Practical variant calling example

get sequence names from FASTA to parallelize by chromosome--not perfect, but works well in practice:

seqnames=$(grep ">"$REF | awk '{ print substr($1, 2, length($1)) }')


run samtools

parallel --keep-order --max-procs 11 "samtools mpileup -Euf $REF -r {}$BAM \
| bcftools view -v -" ::: $seqnames \ | vcffirstheader \ | vt normalize -r$REF - > $VCF  where vcffirstheader is from vcflib and vt normalize is from https://github.com/atks/vt Same for freebayes: parallel --keep-order --max-procs 11 "freebayes --fasta-reference$REF \
--genotype-qualities --experimental-gls \
--region {} $BAM " :::$seqnames \
| vt normalize -r $REF - >$VCF

3
Entering edit mode

When your parallel command spans multiple screen lines it is time to consider using a bash function instead:

my_freebayes() {
freebayes --fasta-reference $REF --genotype-qualities --experimental-gls --region "$1" $BAM } export -f my_freebayes parallel --keep-order --max-procs 11 my_freebayes :::$seqnames \
| vt normalize -r $REF - >$VCF


But it is purely a matter of taste.

3
Entering edit mode
2
Entering edit mode
2.1 years ago
ole.tange ★ 4.0k

GNU Parallel now has a cheat sheet:

https://www.gnu.org/software/parallel/parallel_cheat.pdf

1
Entering edit mode
4 months ago
ole.tange ★ 4.0k

EXAMPLE: grouping of lines

GNU Parallel > 20190522 can split piped input into chunks based on the value of a given field.

You have input as:

sampleID,chr1, ...
sampleID,chr1, ...
:
sampleID,chr1, ...
sampleID,chr2, ...
:
sampleID,chr2, ...
sampleID,chr3, ...


You have a program that reads lines for one chromosome (process_chr), so you want the input to be chopped into chunks based on the value in column 2:

cat file | parallel --group-by 2 --colsep , -N1 --pipe process_chr


If process_chr reads 1 or more chromosomes:

cat file | parallel --group-by 2 --colsep , --pipe process_chr

0
Entering edit mode
17 months ago
ole.tange ★ 4.0k

EXAMPLE: Call program with FASTA sequence

FASTA files have the format:

>Sequence name1
sequence
sequence continued
>Sequence name2
sequence
sequence continued
more sequence


To call myprog with the sequence as argument run:

cat file.fasta |
parallel --pipe -N1 --recstart '>' --rrs \
'read a; echo Name: "$a"; myprog$(tr -d "\n")'