Tool: Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them
145
gravatar for ole.tange
3.6 years ago by
ole.tange2.3k
Denmark
ole.tange2.3k wrote:

Article describing tool (for citations):

O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.

Authors' website for obtaining code:

http://www.gnu.org/software/parallel/

All new computers have multiple cores. Many bioinformatics tools are serial in nature and will therefore not use the multiple cores. However, many bioinformatics tasks (especially within NGS) are extremely parallelizeable:

  • Run the same program on many files
  • Run the same program on every sequence

GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.

If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:

Simple scheduling

GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:

GNU Parallel scheduling

Installation

A personal installation does not require root access. It can be done in 10 seconds by doing this:

(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash

For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README

EXAMPLE: Replace a for-loop

It is often faster to write a command using GNU Parallel than making a for loop:

for i in *gz; do 
  zcat $i > $(basename $i .gz).unpacked
done

can be written as:

parallel 'zcat {} > {.}.unpacked' ::: *.gz

The added benefit is that the zcats are run in parallel - one per CPU core.

EXAMPLE: Parallelizing BLAT

This will start a blat process for each processor and distribute foo.fa to these in 1 MB blocks:

cat foo.fa | parallel --round-robin --pipe --recstart ">" "blat -noHead genome.fa stdin >(cat) >&2" >foo.psl

EXAMPLE: Blast on multiple machines

Assume you have a 1 GB fasta file that you want blast, GNU Parallel can then split the fasta file into 100 KB chunks and run 1 jobs per CPU core:

cat 1gb.fasta | parallel --block 100k --recstart '>' --pipe blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > results

If you have access to the local machine, server1 and server2, GNU Parallel can distribute the jobs to each of the servers. It will automatically detect how many CPU cores are on each of the servers:

cat 1gb.fasta | parallel -S :,server1,server2 --block 100k --recstart '>' --pipe blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > result

EXAMPLE: Run bigWigToWig for each chromosome

If you have one file per chomosome it is easy to parallelize processing each file. Here we do bigWigToWig for chromosome 1..19 + X Y M. These will run in parallel but only one job per CPU core. The {} will be substituted with arguments following the separator ':::'.

parallel bigWigToWig -chrom=chr{} wgEncodeCrgMapabilityAlign36mer_mm9.bigWig mm9_36mer_chr{}.map ::: {1..19} X Y M

EXAMPLE: Running composed commands

GNU Parallel is not limited to running a single command. It can run a composed command. Here is now you process multiple FASTA files using Biopieces (which uses pipes to communicate):

parallel 'read_fasta -i {} | extract_seq -l 5 | write_fasta -o {.}_trim.fna -x' ::: *.fna

See also: http://code.google.com/p/biopieces/wiki/HowTo#Howto_use_Biopieces_with_GNU_Parallel

EXAMPLE: Running experiments

Experiments often have several parameters where every combination should be tested. Assume we have a program called experiment that takes 3 arguments: --age --sex --chr:

experiment --age 18 --sex M --chr 22

Now we want to run experiment for every combination of ages 1..80, sex M/F, chr 1..22+XY:

parallel experiment --age {1} --sex {2} --chr {3} ::: {1..80} ::: M F ::: {1..22} X Y

To save the output in different files you could do:

parallel experiment --age {1} --sex {2} --chr {3} '>' output.{1}.{2}.{3} ::: {1..80} ::: M F ::: {1..22} X Y

But GNU Parallel can structure the output into directories so you avoid having thousands of output files in a single dir:

parallel --results outputdir experiment --age {1} --sex {2} --chr {3} ::: {1..80} ::: M F ::: {1..22} X Y

This will create files like outputdir/1/80/2/M/3/X/stdout containing the standard output of the job.

If you have many different parameters it may be handy to name them:

parallel --result outputdir --header : experiment --age {AGE} --sex {SEX} --chr {CHR} ::: AGE {1..80} ::: SEX M F ::: CHR {1..22} X Y

Then the output files will be named like outputdir/AGE/80/CHR/Y/SEX/F/stdout

If one of your parameters take on many different values, these can be read from a file using '::::'

echo AGE > age_file
seq 1 80 >> age_file
parallel --results outputdir --header : experiment --age {AGE} --sex {SEX} --chr {CHR} :::: age_file ::: SEX M F ::: CHR {1..22} X Y

EXAMPLE(advanced): Using GNU Parallel to parallelize you own scripts

Assume you have BASH/Perl/Python script called launch. It takes one arguments, ID:

launch ID

Using parallel you can run multiple IDs in parallel using:

parallel launch ::: ID1 ID2 ...

But you would like to hide this complexity from the user, so the user only has to do:

launch ID1 ID2 ...

You can do that using --shebang-wrap. Change the shebang line from:

#!/usr/bin/env bash
#!/usr/bin/env perl
#!/usr/bin/env python

to:

#!/usr/bin/parallel --shebang-wrap bash
#!/usr/bin/parallel --shebang-wrap perl
#!/usr/bin/parallel --shebang-wrap python

You further develop your script so it now takes an ID and a DIR:

launch ID DIR

You would like it to take multiple IDs but only one DIR, and run the IDs in parallel. Again just change the shebang line to:

#!/usr/bin/parallel --shebang-wrap bash

And now you can run:

launch ID1 ID2 ID3 ::: DIR

Learn more

See more examples: http://www.gnu.org/software/parallel/man.html

Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Walk through the tutorial once a year - your command line will love you for it: http://www.gnu.org/software/parallel/parallel_tutorial.html

Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

#ilovefs

If you like GNU Parallel:

  • Give a demo at your local user group/team/colleagues (remember to show them --bibtex)
  • Post the intro videos on Reddit/Diaspora*/forums/blogs/ Identi.ca/Google+/Twitter/Facebook/Linkedin/mailing lists
  • Get the merchandise https://www.gnu.org/s/parallel/merchandise.html
  • Request or write a review for your favourite blog or magazine
  • Request or build a package for your favourite distribution (if it is not already there)
  • Invite me for your next conference

When using programs that use GNU Parallel to process data for publication you should cite as per `parallel --bibtex`. If you prefer not to cite, contact me.

If GNU Parallel saves you money:

tool ngs next-gen tools parallel • 71k views
ADD COMMENTlink modified 5 months ago • written 3.6 years ago by ole.tange2.3k
3

Excellent examples. I've been using GNU Parallel for a while, but I learned a lot by reading this. Thanks for posting, and for the videos (those really helped me get off the ground with Parallel).

ADD REPLYlink written 3.6 years ago by SES7.7k
2

This is very very useful. Thanks for the concrete examples. BTW, about zcat, a multithreaded version of gzip exists, it is called "pigz" ;-)

ADD REPLYlink written 3.5 years ago by Manu Prestat3.7k

pigz (http://zlib.net/pigz/) is simple and can save a very significant amount of time if you have a lot of threads available.

ADD REPLYlink written 3.4 years ago by Eric Normandeau9.0k
1

Great. On a cluster, does one need to acquire/assign the cores to be used using MPI/smp first or can just run parallel without it.

ADD REPLYlink written 3.6 years ago by Sukhdeep Singh8.3k
1

That would depend on the rules for your cluster. Default for GNU Parallel is to spawn one process per cpu core.

ADD REPLYlink written 3.6 years ago by ole.tange2.3k

Its Rocks running JAVA SGE, I will test and see. Cheers

ADD REPLYlink written 3.6 years ago by Sukhdeep Singh8.3k
1

@ole.tange maybe you could briefly explain why parallel is superior to a for loop - aside from the shorter syntax.

ADD REPLYlink written 3.6 years ago by Martin A Hansen2.9k
2

The for loop executes the commands one at a time. Parallel can use multiple processors to run them in parallel.

ADD REPLYlink written 3.6 years ago by Giovanni M Dall'Olio24k
2

A for loop can start many jobs simultaneously by putting them in the background -> for i in *; do cat $i & done; - but that way you may start 1000 jobs which is probably inefficient. Parallel does some clever load balancing.

ADD REPLYlink written 3.6 years ago by Martin A Hansen2.9k
1

What about dealing with the many bioinformatics tools that are do not accept streams as input and insist on reading files instead? (e.g. blat I think). Is there an easy way to autogenerate and delete such files within a single line of "parallel"?

ADD REPLYlink written 3.5 years ago by Yannick Wurm2.1k
5

You can use named pipes to stream data to placeholder files, which can be used with some tools that do not read streams: http://en.wikipedia.org/wiki/Named_pipe

ADD REPLYlink modified 2.4 years ago • written 3.5 years ago by Alex Reynolds15k

Wow - very cool!

ADD REPLYlink written 3.5 years ago by Yannick Wurm2.1k

amazing, upvote

ADD REPLYlink written 2.4 years ago by Christian2.3k
2

In these cases, what I do is to write a wrapper script, which generates any parameter file needed for running the script.

ADD REPLYlink written 3.5 years ago by Giovanni M Dall'Olio24k
2

One solution is to create a file: cat file | parallel --pipe "cat >{#}; my_program {#}; rm {#}". Alex suggests using named pipes - which is more efficient, but does not work with every tool: cat file | parallel --pipe "mkfifo {#}; my_program {#} & cat >{#};rm {#}"

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by ole.tange2.3k

Hey Ole, how to bypass awk quotes, example (counting the reads in fastq files)

parallel 'echo && gunzip -c | wc -l | awk \'{print $1/4}\'' ::: *fastq.gz wont work

ADD REPLYlink written 3.5 years ago by Sukhdeep Singh8.3k
4

Hi Sukhdeep, the following worked for me: parallel "echo {} && gunzip -c {} | wc -l | awk '{d=\$1; print d/4;}'" ::: *.gz

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Alex Reynolds15k

Super Alex, it works :)

ADD REPLYlink written 3.5 years ago by Sukhdeep Singh8.3k

Use --fifo:

cat file | parallel --fifo --pipe wc {}

Or --cat:

cat file | parallel --cat --pipe wc {}
ADD REPLYlink written 15 months ago by ole.tange2.3k
1

How do we put wait command between different parallel runs ? I have a script that performs multiple jobs in order.

parallel jobs 1

wait

parallel jobs 2

...etc

Will that work ?

ADD REPLYlink written 20 months ago by Goutham Atla6.0k

why do you need to call wait ?

ADD REPLYlink written 20 months ago by Pierre Lindenbaum85k

I am thinking that everything starts parallelly. I have to wait until the Jobs 1 finishes and then start jobs 2.

ADD REPLYlink written 20 months ago by Goutham Atla6.0k

use GNU make with option -j

ADD REPLYlink written 20 months ago by Pierre Lindenbaum85k

Maybe explore GNU Parallel's semaphore options: http://www.gnu.org/software/parallel/man.html#EXAMPLE:-Working-as-mutex-and-counting-semaphore

(Though a make-based process as Pierre suggests, or dedicated job scheduler is probably going to be easier to maintain.)

ADD REPLYlink modified 20 months ago • written 20 months ago by Alex Reynolds15k

It will work, but there is really no need to call wait. GNU Parallel does that automatically. Try:

parallel 'sleep {};echo Jobslot {%} slept {} seconds' ::: 4 3 2 1
seq 5 -.1 0 | parallel 'sleep {};echo Jobslot {%} slept {} seconds'
seq 5 -.1 0 | parallel -j0 'sleep {};echo Jobslot {%} slept {} seconds'

You will see that GNU Parallel only finishes after the last job is done.

 

ADD REPLYlink modified 20 months ago • written 20 months ago by ole.tange2.3k

This is just wonderful. As you have mentioned above GNU Parallel to parallelize you own scripts which can be bash/python/perl etc which can take multiple IDs (i.e, arguments) at a single go. Does it do the other way so? which taking a single argument and run it in multiple cores of the computer???

ADD REPLYlink written 2.5 years ago by tarakaramji0
2

How would you run a single argument on multiple cores?

ADD REPLYlink written 2.5 years ago by ole.tange2.3k

What options are available if you want to utilize other machines but they require a password for ssh? Is there a way to force using rsh?

ADD REPLYlink written 21 months ago by salamayg20

define a key with ssh-keygen ? http://rcsg-gsir.imsb-dsgi.nrc-cnrc.gc.ca/documents/internet/node31.html

ADD REPLYlink written 21 months ago by Pierre Lindenbaum85k

Set up RSA keys for password-less SSH.

ADD REPLYlink written 21 months ago by Alex Reynolds15k

Have a look at: https://wiki.dna.ku.dk/dokuwiki/doku.php?id=ssh_config

ADD REPLYlink written 21 months ago by ole.tange2.3k
15
gravatar for Pierre Lindenbaum
3.0 years ago by
France
Pierre Lindenbaum85k wrote:

I put my notebook about GNU parallel on figshare:

http://figshare.com/articles/GNU_parallel_for_Bioinformatics_my_notebook/822138

My document follows Ole Tange’s GNU parallel tutorial ( http://www.gnu.org/software/parallel/parallel_tutorial.html ) but I tried to use some bioinformatics-related examples (align with BWA, Samtools, etc.. ).

ADD COMMENTlink written 3.0 years ago by Pierre Lindenbaum85k
2

Thanks so much for sharing this, it's really useful.  I notice that you don't include an example for sorting bam files using parallel.  I'm trying that now:

sort_index_bam(){
    outfile=`echo $1 | sed -e 's/.bam/_sorted/g'`
    samtools sort $1 $outfile
    index_file="$outfile.bam"
    samtools index $outfile
}
export -f sort_index_bam
parallel -j10 --progress --xapply sort_index_bam ::: `ls -1 *.bam`

And get the error (for example)

[E::hts_open] fail to open file 'HiC_LY1_1_NoIndex_L003_034_sorted.bam'
local:10/33/100%/42.8s [bam_sort_core] merging from 3 files...

Perhaps it's something to do with how parallel schedules its worker threads?  The same scripted commands work fine on the command line in serial.  I'm wondering if you have tried something similar.

ADD REPLYlink written 16 months ago by Lee Zamparo50

Try this (requires parallel 20140822 or later):

sort_index_bam(){
    samtools sort "$1" "$2"
    samtools index "$2"
}
export -f sort_index_bam
parallel --bar sort_index_bam {} '{=s/.bam/_sorted.bam/=}' ::: *.bam

 

ADD REPLYlink written 16 months ago by ole.tange2.3k
1

This is awesome!

ADD REPLYlink written 3.0 years ago by Manu Prestat3.7k
1

Thank you Pierre! Do you think it would be possible to have a version with bigger fonts? At 125%, it is difficult to read.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by Eric Normandeau9.0k
8
gravatar for lh3
3.6 years ago by
lh328k
United States
lh328k wrote:

All the clusters I use require to use SGE/LSF. My understanding is that parallel does not support SGE/LSF (correct me if I am wrong). I would recommend a general way to compose multiple command lines as:

seq 22 | xargs -i echo samtools cmd -r chr{} aln.bam | parallel -j 5
ls *.bed | sed s,.bed,, | xargs -i echo mv {}.bed {}.gff | sh
ls *.psmcfa | sed s,.psmcfa,, | xargs -i echo psmc {}.psmcfa \> {}.psmc \& | sh
ls *.psmcfa | sed s,.psmcfa,, | xargs -i echo bsub 'psmc -o {}.psmc {}.psmcfa' | sh

For the last command line to submit jobs to LSF, I more often use my asub script:

ls *.psmcfa | sed s,.psmcfa,, | xargs -i echo psmc -o {}.psmc {}.psmcfa | asub

You can see from the above examples that the general pattern is to feed a list to xargs -i echo to let it print the commands to stdout. The last command after pipe | can be sh if you know they run fast, parallel if you want to control the number of concurrent jobs on the same machine, or asub if you want to submit to LSF/SGE. There are also a few variants: e.g. put & or bsub in the echo command. With xargs -i echo, you will not be bound to the parallel grammar. Another advantage is for complex command lines, you can pipe it to more to visually check if the commands are correct. At least for me, I frequently see problems from more before submitting thousands of jobs.

ADD COMMENTlink written 3.6 years ago by lh328k
2

Here are your examples using GNU Parallel:

seq 22 | parallel -j5 samtools cmd -r chr{} aln.bam
ls *.bed | parallel mv {} {.}.gff
ls *.psmcfa | parallel psmc {} \> {.}.psmc
ls *.psmcfa | parallel bsub psmc -o {.}.psmc {}
ls *.psmcfa | parallel echo psmc -o {.}.psmc {} | asub

It is shorter and IMHO easier to read.

You can use --dry-run if you want to see what would be done.

ls *.psmcfa | parallel --dry-run psmc {} \> {.}.psmc
ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by ole.tange2.3k
1

I have know the basic of parallel for some time. My problem is it is not a standard tool. None of the machines I use in multiple institutes have that. Sed/awk/perl/xargs exist in every Unix distribution. My point is to learn to construct command lines in general. There may be more complicated cases you cannot do with one parallel.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by lh328k
4

As long as you are allowed to run your own scripts you can run GNU Parallel. 10 seconds installation: 'wget -O - pi.dk/3 | bash'. Read Minimal installation in http://git.savannah.gnu.org/cgit/parallel.git/tree/README

The examples you provide deal badly with spaces in the filenames. Using GNU Parallel this is no longer an issue.

ADD REPLYlink modified 3.4 years ago • written 3.6 years ago by ole.tange2.3k
2

These are both good points. There are good arguments to make for both cases. In one hand, you don't want to clutter your pipelines with tools and binaries that are not minimally standard. On the other hand, good tools will eventually become a standard (this tool may be an example). Somewhat related, I think this project (shameless plug): https://github.com/drio/bio.brew can help mitigate the management of software tools. Specially useful when you don't have root access to the boxes where you do your analysis.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Drio880
5
gravatar for Alex Reynolds
3.6 years ago by
Alex Reynolds15k
Seattle, WA USA
Alex Reynolds15k wrote:

EXAMPLE: Using multiple SSH-capable hosts to efficiently generate a highly-compressed BED archive

For labs without an SGE installation but lots of quiet hosts running an SSH service and BEDOPS tools, we can use GNU Parallel to quickly generate per-chromosome, highly-compressed archives of an input BED file, stitching them together at the end into one complete Starch archive.

This archival process can reduce a four-column BED file to about 5% of its original size, while preserving the ability to do memory-efficient, high-performance and powerful multi-set and statistical operations with bedops, bedmap, bedextract and closest-features:

$ PARALLEL_HOSTS=foo,bar,baz
$ bedextract --list-chr input.bed \
    | parallel \
        --sshlogin $PARALLEL_HOSTS \
        "bedextract {} input.bed | starch - > input.{}.starch"
$ starchcat input.*.starch > input.starch
$ rm input.*.starch

Once the archive is made, it can be operated on directly, just like a BED file, e.g.:

$ bedops --chrom chrN --element-of -1 input.starch another.bed
...

We have posted a GNU Parallel-based variant of our *starchcluster* script, part of the BEDOPS toolkit, which uses some of the above code to facilitate using multiple hosts to efficiently parallelize the process of making highly-compressed and operable Starch archives out of BED inputs.

ADD COMMENTlink modified 2.6 years ago • written 3.6 years ago by Alex Reynolds15k
1

I am trying to figure out why you need --max-lines 1 --keep-order. Would it not work just as well without? Also why not use GNU Parallel's automation to figure out the number of cores instead of forcing $MAX_JOBS in parallel? (and you miss a | before parallel)

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by ole.tange2.3k
1

You're right — assuming that defaults do not change, then it isn't necessary to be explicit. Thanks for the pipe catch.

ADD REPLYlink written 3.6 years ago by Alex Reynolds15k

Is it possible to make parallel use rsh in the case where ssh requires a password?

ADD REPLYlink written 21 months ago by salamayg20
2

Not sure, but you can certainly use RSA keys to SSH across hosts without a password. See: http://archive.news.softpedia.com/news/How-to-Use-RSA-Key-for-SSH-Authentication-38599.shtml

ADD REPLYlink written 21 months ago by Alex Reynolds15k

Yes. Normally a --sshlogin is simply a host name:

server1

If you want to use another "ssh-command" than ssh, then you prepend with full path to the command:

/usr/bin/rsh server1

Making it look like this:

parallel -S "/usr/bin/rsh server1" do_stuff

But if possible it is a much better idea to use SSH with SSH-agent to avoid typing the passwords: https://wiki.dna.ku.dk/dokuwiki/doku.php?id=ssh_config

ADD REPLYlink modified 21 months ago • written 21 months ago by ole.tange2.3k
3
gravatar for Giovanni M Dall'Olio
3.6 years ago by
London, UK
Giovanni M Dall'Olio24k wrote:

EXAMPLE: Coalescent simulations using COSI

This script can be used to launch multiple cosi simulations using the GNU/parallel tool, or a Sun Grid Engine environment.

This is how I launch it:

$: seq 1 100 | parallel ./launch_single_cosi_iteration.sh {} outputfolder

This is an home made script that I wrote to execute a one time task. I've tried to adapt it for a general case and to improve the documentation, but I didn't spent much time on it. Please ask me if it doesn't work or if you have any doubt.


EXAMPLE: convert all PED files in a directory to BED

This is an example of how GNU/parallel can be used in combination with plink (or vcftools) to execute tasks on a set of data files. Note how "{.}" takes the value of the file name without the extension.

find . -type f -maxdepth 1 -iname "*ped" | parallel "plink --make-bed --noweb --file {.} --out {.}"
ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by Giovanni M Dall'Olio24k
2

Since you do not use UNIX special chars (such as | * > &) in your command the " are not needed.

ADD REPLYlink written 3.6 years ago by ole.tange2.3k

thank you, I didn't know that :-)

ADD REPLYlink written 3.6 years ago by Giovanni M Dall'Olio24k
3
gravatar for brentp
2.1 years ago by
brentp21k
Salt Lake City, UT
brentp21k wrote:

Practical variant calling example

get sequence names from FASTA to parallelize by chromosome--not perfect, but works well in practice:


seqnames=$(grep ">" $REF | awk '{ print substr($1, 2, length($1)) }')

run samtools


parallel --keep-order --max-procs 11 "samtools mpileup -Euf $REF -r {} $BAM \
   | bcftools view -v -" ::: $seqnames \
   | vcffirstheader \
   | vt normalize -r $REF - > $VCF

 

where vcffirstheader is from vcflib and vt normalize is from https://github.com/atks/vt

Same for freebayes:


parallel --keep-order --max-procs 11 "freebayes --fasta-reference $REF \
    --genotype-qualities --experimental-gls \
    --region {} $BAM  " ::: $seqnames \
    | vcffirstheader \
    | vt normalize -r $REF - > $VCF
ADD COMMENTlink modified 2.1 years ago by Istvan Albert ♦♦ 65k • written 2.1 years ago by brentp21k
2

When your parallel command spans multiple screen lines it is time to consider using a bash function instead:

my_freebayes() {
  freebayes --fasta-reference $REF --genotype-qualities --experimental-gls --region "$1" $BAM
}
export -f my_freebayes

parallel --keep-order --max-procs 11 my_freebayes ::: $seqnames \
    | vcffirstheader \
    | vt normalize -r $REF - > $VCF

But it is purely a matter of taste.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by ole.tange2.3k
2
gravatar for madismetsis
2.5 years ago by
madismetsis20
madismetsis20 wrote:

First, GNU parallel has the best installability of everything I have seen in my live. And all examples worked - gunzips, blast and blasts etc. However, I got stuck on one thing. I have a simple perl script called transparal.pl. It makes soemthing to a file that is provided as argument. And the original one behaves as expected after chmod 700

$ ./transparal_old.pl 
Could not open file! at ./transparal_old.pl line 5. ( did not give the input file name!)

Then I changed the shebang to

#!/usr/bin/parallel --shebang-wrap perl

and ...

$ ./transparal.pl 
-bash: ./transparal.pl: /usr/bin/parallel: bad interpreter: No such file or directory

cheking GNU parallel

$ which parallel
/usr/local/bin/parallel

Looks ok. Confused.

ADD COMMENTlink modified 2.5 years ago by Michael Dondrup38k • written 2.5 years ago by madismetsis20
2

look at #!/usr/bin/parallel vs. where it really is: #!/usr/local/bin/parallel; could also try this instead #!/usr/bin/env parallel, then it will just take parallel from your PATH, that's possibly the (mostly) portable way. However, I am not sure if the /usr/bin/env way handles parameters to the program, edit: see http://stackoverflow.com/questions/4303128/how-to-use-multiple-arguments-with-a-shebang-i-e which in conclusion means you can on many systems use only the correct absolute path (and #!/usr/bin/env parallel --argstothe program most likely will not work).

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Michael Dondrup38k

I wonder if recent builds of the binary expect it to be in /usr/bin. Still trying to troubleshoot a similar problem.

ADD REPLYlink written 2.4 years ago by Alex Reynolds15k
1

I have built my own parallel and installed it in $HOME/bin which is in my PATH, worked fine for me.

ADD REPLYlink written 2.4 years ago by Michael Dondrup38k
1
gravatar for Yaseen Ladak
10 months ago by
Yaseen Ladak10
United Kingdom, London, Imperial College London
Yaseen Ladak10 wrote:

Hi All, I need some help to run my two samples to run freebayes using parallel. I saw this in the previous post but got confused:

parallel --keep-order --max-procs 11 "freebayes --fasta-reference $REF \

    --genotype-qualities --experimental-gls \
    --region {} $BAM  " ::: $seqnames \
    | vcffirstheader \
    | vt normalize -r $REF - > $VCF

I am a bit confused I have two BAM files that need to be run with freebayes in one case I dont want to use 

vcffirstheader and vt normalise and in the second run I want to

Lets say the files are S1.bam and S2.bam and the reference is hg38.fa. Also do I need 

--region {} $BAM ?

parallel --keep-order --max-procs 0  "freebayes --fasta-reference hg38.fa  " ::: S1.bam S2.bam > output_1.vcf

with the vcffirst header and vt normalise would it look like:

parallel --keep-order --max-procs 11 "freebayes --fasta-reference  hg38.fa " ::: S1.bam S2.bam | vcffirstheader | vt normalize -r hg38.fa - > output_2.vcf

 

Can someone please help me?

 

Thanks

 

 

 

 

 

 

ADD COMMENTlink written 10 months ago by Yaseen Ladak10
0
gravatar for 5heikki
19 months ago by
5heikki5.4k
Finland
5heikki5.4k wrote:

What is the best way to deal with the error below?

 

parallel "do something" ::: seq.*
-bash: /usr/local/bin/parallel: Argument list too long

 

Alternatively, if somebody could show me how to pipe to a command defined in a bash script, that would be just wonderful. Right now, I'm doing:

#split the multifasta into individual seqs
cat $NAME/file.fna | parallel --recstart '>' -N1 --pipe "cat - > $NAME/seq.{#}"
#do stuff with the split files
export -f blastFunction
parallel blastFunction ::: $NAME/seq.*

and blastFunction begins like this:

blastFunction() {
        BLAST=$(blastn -query $1 -subject $1 -outfmt 6 -perc_identity 100)

 

 

ADD COMMENTlink modified 19 months ago • written 19 months ago by 5heikki5.4k
2

Try giving a text file which lists the input files to parallel instead of direct arguments. You can do this via 4 colons (::::)

Get a list of files by:

ls *.txt > myFiles

Then do:

parallel "do something" :::: myFiles
ADD REPLYlink modified 19 months ago • written 19 months ago by Damian Kao13k

Thanks for the suggestion, but

-bash: /bin/ls: Argument list too long
ADD REPLYlink written 19 months ago by 5heikki5.4k

yes, if the argument list is too long for bash, it won't work with any command, where you let the shell glob the file names.

ADD REPLYlink written 19 months ago by Michael Dondrup38k
1

Example "EXAMPLE: convert all PED files in a directory to BED" should work for this using find, you seem to have too many txt files

ADD REPLYlink written 19 months ago by Michael Dondrup38k

Thanks, this one worked

find $NAME/ -type f -maxdepth 1 -iname "seq.*" | parallel blastFunction

Of course optimal solution would be if I could skip the part of creating thousands and thousands of files..

ADD REPLYlink written 19 months ago by 5heikki5.4k

you might try the >> operator next time you create files

ADD REPLYlink modified 19 months ago • written 19 months ago by Michael Dondrup38k
0
gravatar for Alejandro Jimenez Sanchez
6 months ago by
Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK

Hi I have a burning question,

I want to run a script named "predict_binding.py". Its syntax is:

./predict_binding.py [argA] [argB] [argC] ./file.txt

file.txt has a column of strings with the same length:

string_1 
string_2 
string_3
...
string_n

predict_binding.py works with the first 3 arguments and string_1, then the 3 arguments and string_2, and so on.

That's fine, but now I have m argB, and I want to test all of them. I want to use the cluster for this, and this looks like a perfect job for parallel, isn't it?

After reading the manual and spending hours to try to make it work I realised I need some help.

What works so far (and is trivial) is:

parallel --verbose ./predict_binding ::: argA ::: argBi ::: argC ::: ./file.txt

This gives the same result as:

./predict_binding.py argA argBi argC ./file.txt

And indeed the flag --verbose says that the command looks like

./predict_binidng.py argA argBi argC ./file.txt

but I want to test all arg2, so I made a file called args.txt, which looks like this:

argA argB1 argC ./file.txt
argA argB2 argC ./file.txt
...
argA argBm argC ./file.txt

If I do:

cat args.txt | parallel --verbose ./predict_binding.py {}

I get an error from ./predict_binding saying:

predict_binding.py: error: incorrect number of arguments

And verbose says that the command looks like: ./predict_binding.py argA\ argBi\ argC\ ./file.txt

So, maybe those backslashes are affecting the input of ./predict_binding? How could I avoid them?

I have tried using double and single quotations " ', backslash \, backslash with single quote \', none has work!

I also tried:

cat ./args.txt | parallel --verbose echo | ./predict_binding

Same error as above.

And also I tried to use a function like:

binding_func ( ) { ./predict_binding argA $1 argC ./file.txt}

Interestingly, binding_func works for:

parallel binding_func ::: argB1

But if I do:

parallel binding_func ::: argB1 argB2

It gives the result for one arg but fails (same error as above) for the other.

If I put only argB1 in the args.txt file and do:

cat args.txt | parallel --verbose binding_func {}

It fails miserably with the same error: predict_binding.py: error: incorrect number of arguments

It seems a very trivial and easy problem but I haven't been able to solve it }:(

I would appreciate very much any help provided. :)

ADD COMMENTlink modified 6 months ago • written 6 months ago by Alejandro Jimenez Sanchez70

./predict_binding.py argA argB argC :::: ./file.txt

ADD REPLYlink written 4 months ago by ole.tange2.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1440 users visited in the last hour