Question

bash program with two subprograms. How to use getopts in all of them?

0

Entering edit mode

18 months ago

arturo.marin ▴ 20

I have a program that we will call A, and two other programs that we will call B and C. Program A is a BASH script that simply can runs the other two, and displays help options. B and C are BASH programs that execute other programs in the C language. The basic code of program A is:

while getopts "hv" option; do
   case ${option} in
     h) # Call the Help function
        Help
        exit;;
     v) # Call the Version function
        Version
        exit;;
     \?) # Invalid option
         echo "Invalid option. Use flag -h for help."
         exit;;
   esac
done

if [ "$1" = "B" ] ; then
   source <path_to>/B.sh
fi

if [ "$1" = "C" ] ; then
   source <path_to>/C.sh
fi

My question is if it is possible to use getopts for specific options when executing B and for other specific options when executing C, in both cases when using A as the main program. For example, if -t is the number of threads we want the programs called by script B to run with:

A B -t 10

The getopts of program B is something like this:

while getopts t:d: flag
do
   case "${flag}" in
       t) NCPUS=${OPTARG};; # Number of CPUs using by the software executed
       d) DATA_PATH=${OPTARG};; # data path
   esac
done

I remember that there is some bioinformatics program that does this in BASH... maybe I could see how it does it in the code, but I don't remember the name. Any suggestion?

bash pipeline • 2.2k views

ADD COMMENT • link 18 months ago by arturo.marin ▴ 20

0

Entering edit mode

Bioinformatics program? Pipeline? Maybe you're thinking of Nextflow. You can use Nextflow to manage your pipeline and the bash part can just be copy-pasted to the script block inside a process (a task definition in Nextflow lingo).

ADD REPLY • link 18 months ago by Marcel Ribeiro-Dantas ▴ 470

0

Entering edit mode

18 months ago

arturo.marin ▴ 20

Thank you both very much. I have programmed a few years ago for my thesis (in the framework of quasispecies and theoretical biology) a Gillespie simulation (SSA) in c, and over the last years I have compiled many programs in c and c++. I have some experience in bash, python, matlab and c... and more than 10 years experience in bioinformatics (including MSc in Computational Biology and Bioinformatics) and I would never have thought that makefiles would be used to make pipelines and for what I want . The only question I had is how to remove the make name when executing the pipeline that I am doing. My intention is to publish it. This can be done easily. If for example we want to call the pipeliene as mypipeline simply making a bash script called mypipeline with the following code seems to be enough:

#!/bin/sh

make "$@"

I am going to implement my code in bash to makefile. It should be very simple, except for some loops... which should be easy to solve as indicated by Istvan.

ADD COMMENT • link 18 months ago by arturo.marin ▴ 20

score 3 · Accepted Answer · 2022-10-19

3

Entering edit mode

18 months ago

Istvan Albert 100k

The first step in automation are makefiles.

Learn to use them and you can leave behind the cringe and frustration filled world of bash programming. Makefiles will prepare you to the next levels of automation.

Most people do not understand how easy is to pass parameters into Makefiles.

Here is a simple introduction

https://makefiletutorial.com/

Makefile example:

NAME = Joe

hello:
    echo Hello $NAME

bye:
    echo Goodbye $NAME

use it like so:

make hello NAME=Jane

Read also the blog post: Your Makefiles are wrong:

https://tech.davis-hansson.com/p/make/

ADD COMMENT • link 18 months ago by Istvan Albert 100k

0

Entering edit mode

Granted I've never written a Makefile, if one is new to the pipeline/automation business in bioinformatics I'd suggest going straight to snakemake instead of make. I think snakemake is easier to learn and you get more mileage.

Having said that, and also with respect to @Marcel Ribeiro-Dantas' comment, my understanding of the question is about formatting the help in way similar to argparse subcommands in python rather than pipelines. Although ultimately it may well be about pipelining...

ADD REPLY • link 18 months ago by dariober 14k

3

Entering edit mode

I strongly disagree, and I think snakmake and nextflow are overly complex software that are NOT suited as an introduction to bioinformatics.

One should only learn snakemake and nextflow when they feel their regular approaches limit them. Otherwise, all they will do is fight the system itself instead of learning bioinformatics.

Take the following makefile; look how short, simple and explicit it all is.

REF = genome.fa
R1 = fastq/read1.fq
BAM = alignment.bam

index:
    bwa index ${REF}

align:
    bwa mem ${REF} ${R1} | samtools sort > ${BAM}
    samtools index ${BAM}

this makefile can be used to align any single-end data by just invoking it as:

 make index REF=reference.fa

to index the reference, then to align any data just do:

 make align REF=reference.fa  R1=foo.fq BAM=foo.bam

 make align REF=reference.fa  R1=bar.fq BAM=bar.bam

That's it.

Note that no argument parsing is needed, no special novel configuration language to learn, and no documentation to scour on how to use this or that.

Makefiles can do a lot more of course, one could make it into a "proper" makefile with full dependencies.

But that is also the beauty of Makefiles - we don't have to add anything fancy at the start. One can make wondrously simple and explicit pipelines with it. Makefiles can grow with us.

ADD REPLY • link 18 months ago by Istvan Albert 100k

0

Entering edit mode

Hi Istvan, it looks like we are going to agree to disagree as I'm not persuaded by your reply.

snakemake and nextflow are overly complex software that are NOT suited as an introduction to bioinformatics.

Sure, snakemake and nextflow are workflow managers that incidentally have been developed with bioinformatics tasks in mind, but they are not introductions to bioinformatics (neither is R or python, they are tools for bioinformatics, not introductions to). In my previous comment I meant that bioinformatics tasks are usually complex enough that you are better off learning snakemake/nextflow straightaway rather than passing by make.

Otherwise, all they will do is fight the system itself instead of learning bioinformatics.

(I don't quite follow what you mean here...) make is a safe investment since it's been around for a long time and so much depends on it that it's not going away any time soon. I'm not sure snakemake/nextflow are there yet. Anyway, as I understand it snakemake follows the same philosophy of make (hence the name) so the two don't differ too much, really.

Note that no argument parsing is needed, no special novel configuration language to learn, and no documentation to scour on how to use this or that.

I would contend that your example works because is very simple but Makefiles do have their own syntax and rules to learn. Your example would be fairly intelligible also in snakemake. Once you have paired and single reads, inconsistent naming of files, or you need to match controls to cases, submit jobs to a cluster etc, I think make becomes unmanageable very quickly. I mean compiling programs is usually a lot easier than bioinformatics tasks.

Very rarely, if ever, I've seen a bioinformatics task written as Makefile. Usually you see a terrible mish-mash of bash and readme files (I've done a lot of that). Switching to snakemake/nextflow is not easy but it pays off.

ADD REPLY • link 18 months ago by dariober 14k

1

Entering edit mode

What I was saying is that makefiles are outstanding alternatives to bash scripts.

They are simpler to handle (automatic parameter passing for example), they can maintain several tasks in a single file, they can provide dependency management and re-entrancy.

It is really weird that people are being regularly taught bash but not make.

I was guilty of this as well. For many years I have never even considered Makefiles. I thought they were complicated beasts. I have only shown my students how to write bash scripts. And above you, yourself, just said you never wrote a Makefile. Perhaps we have a problem with bioinformatics education - which is not surprising - it is in sorry shape for sure.

But then I realized at some point just how extraordinarily simple makefiles are and how incredibly easy is to write them. They scale down exceedingly well.! That is the great thing about them. You don't have to use any dependency management; they are still super useful.

Nowadays, I am refocusing my class to teach my students makefiles. As a result, instead of multiple disjointed scripts, my students (many of them with no prior computational background) within weeks become able to write self-contained readable analysis workflows that contain complete data analyses that are explicit (you can see the command in their complete form), re-entrant (you can easily run any part of it). I also found it so easy to demonstrate makefiles in lectures - I don't need to send them to read manuals or documentation. Just an example will do.

What is beautiful about make is that it is really very similar to bash - the commands are basically run in bash - so you can learn both at the same time. A command that works in bash can be lifted over verbatim as one line.

I consider both snakemake and nextflow too complicated to teach. I am telling you as someone that has been teaching bioinformatics for a decade now.

I think it would be very difficult to learn bash/unix, snakemake and bioinformatics all at the same time. I am also 100% sure you too have learned snakemake after you;ve learned how to use unix and after you have learned how to do bioinformatics. Looking back it may seem to you that you could have started with snakemake but I very much doubt so.

Long story short, give makefiles a try, and see just how well they replace bash scripts.

ADD REPLY • link 18 months ago by Istvan Albert 100k

0

Entering edit mode

(Replying to the answer to avoid too much indenting)

For reference, this is my implementation in snakemake of your example (not tested). The Snakefile:

REF = config['REF']
R1 = config['R1']
BAM = config['BAM']

rule all:
    input:
        BAM,

rule index:
    input:
        ref=REF,
    output:
        idx=REF + '.bwt',
    shell:
        "bwa index {input.ref}"

rule align:
    input:
        ref=REF,
        idx=REF + '.bwt',
        r1=R1,
    output:
        bam=BAM,
    shell:
        r"""
        bwa mem {input.ref} {input.r1} | samtools sort > {output.bam}
        samtools index {output.bam}
        """

To index and align:

snakemake -C REF=genome.fa R1=foo.fq BAM=foo.bam -p -n -j 1

-p -n are optional but very useful, they are to print the shell commands and to execute in dry run mode, respectively. I don't know... It seems quite readable to me although I don't claim it's easy to figure out from scratch. The biggest hurdle, I suppose, is to think in terms of rules being chained by input/output files in order to produce whatever you put in the first rule. But I think this also the case for make.

ADD REPLY • link 18 months ago by dariober 14k

1

Entering edit mode

right, I think your code demonstrates many of the issues I mentioned, even a trivial program is not so simple, compare it to the Makefile

REF=genome.fa
R1=fastq/read1.fq
BAM=alignment.bam

index:
    bwa index ${REF}

align:
    bwa mem ${REF} ${R1} | samtools sort > ${BAM}
    samtools index ${BAM}

count the lines of how much longer the code is, now just count how many weird little formatting widgets you have to have. It makes sense to you because you know Python already, but for someone new to it they will make dozens of mistakes. Teaching snakmake is a separate course in itself.

the raw strings, the triple quotes, the commas, the single quotes, the concatenations, the weird template language with {} then it is all in yaml a super confusing markup actually - yaml relies on invisible formatting characters and indentation that matters immensely, troubleshooting a misaligned yaml can be very frustrating

and then the code is not runnable in bash

that is the beauty of the makefiles that the variables have the same naming formats as in bash, it can be a ultra thin wrapper over bash,

you can write a bash script and when you are happy with it, copy-paste into a Makefile and vice versa

ADD REPLY • link 18 months ago by Istvan Albert 100k

0

Entering edit mode

IMHO the fundamental mistake that snakemake made is that they should have stuck with Python all the way, instead of mixing yaml, Python and bash

configuration via Python module variable is trivial to do (see how well Django does it)

introducing yaml into the picture was and is a huge mistake

ADD REPLY • link 18 months ago by Istvan Albert 100k

0

Entering edit mode

then at any time if you want to add dependencies, take your working Makefile and add a bit twist to it:

REF=genome.fa
R1=fastq/read1.fq
BAM=alignment.bam

{REF}.bwt:
    bwa index ${REF}

align: {REF}.bwt
    bwa mem ${REF} ${R1} | samtools sort > ${BAM}
    samtools index ${BAM}

Boom, I have made some tiny changes, and now the makefile has dependency management. The index is automatically created.

But it is all optional. It worked before and I did not even need to know beforehand what the index was called. I ran the index, saw what it created then added it.

And that is what I love about it.

For the record, I am kicking myself for not realizing sooner how well-suited Makefiles are for bioinformatics.

ADD REPLY • link 18 months ago by Istvan Albert 100k

0

Entering edit mode

I am kicking myself that I did not realize for so long ...

Well, I remember the author of snakemake posting on SeqAnswers his project and I was like Why bother?! That looks so complicated, what's wrong with bash' for loops!?. It took me a few years to make the leap and there is no looking back for the same reasons you say, I guess we can agree on this although from different angles!

ADD REPLY • link 18 months ago by dariober 14k

4

Entering edit mode

there is a difference, though,

I recommend makefiles because they make bash programming simpler, snakemake does not! They are not similar alternatives at all.

FWIW in my opinion snakemake is a replacement for GNU parallel - it matches patterns and builds commands from that, the dependency management is not its main selling point.

When someone recommends snakemake they are replacing the bash for loops with what they think is a simpler construct. That is also what you recall as a motivation for learning it - because that's how it goes.

But, as it turns out GNU parallel is the proper replacement of bash for loops, not makefiles nor snakemake. As a matter of fact the worst parts of makefiles are the pattern-matching rules that create the implicit loops things like

$(filter %.o,$(obj_files)): %.o: %.c

I am not quite sure what the above does, but, as it turns out, it is unnecessary to know them for bioinformatics. Perhaps bioinformaticians don't learn makefiles because of seeing traumatic patterns like the one above.

Snakemake tries to fix the makefile's implicit looping constructs by inventing a new templating language, but it ends up fixing a problem that did not need fixing. Just use parallel instead, it is a tool that pays dividends in so many other ways.

Bioinformatics loops should look like this:

cat samples.txt | parallel bwa mem ${REF} reads/{}.fq | samtools sort > {}.bam

No for loop needed. The above code can be run and tested in bash and then added to the makefile. I find the above a beautiful and elegant construct. Now look at the new makefile that I have easily built from the old one, with just a few words added:

SAMPLES=samples.txt
REF=genome.fa

{REF}.bwt:
    bwa index ${REF}

align: {REF}.bwt
    cat ${SAMPLES} | parallel bwa mem ${REF} ${}_r1.fq  ${}_r2.fq | samtools sort > ${}.bam
    cat ${SAMPLES} | parallel samtools index ${}.bam

I can run it on any sample:

 make align SAMPLES=samples.txt

now this makefile can process any number of samples. If you were to write the same thing in snakemake you will see that you need to make the patterns in different places, and they need to match in different places. Here every pattern is on the same line. Huge win for readability.

As the workflow gets more complicated, a makefile's readability becomes more apparent.

I am telling you I have come to believe that the overwhelming majority of people that recommend snakemake are not fully aware that it is simpler to do the same as a Makefile plus parallel

I know this because I was the same way - I always thought of Makefiles as something with weird, abstract, barely comprehensible pattern-matching rules. But that is only because we were shown makefiles done by programmers that were only compiling code, doing it for decades, going gaga with implicit looping patterns.

Makefiles can also be extraordinarily simple and explicit more readable and comprehenesible than any other approach.

ADD REPLY • link 18 months ago by Istvan Albert 100k