Question

How can I learn how to create a bioinformatics pipeline?

5

Entering edit mode

2.0 years ago

Jean ▴ 60

How [where] can I learn create a bioinformatic pipeline? There's some books or something that guides me?

Thanks a lot :D

pipeline • 7.2k views

ADD COMMENT • link updated 20 months ago by Istvan Albert 101k • written 2.0 years ago by Jean ▴ 60

1

Entering edit mode

While a pipeline can mean different things you may want to learn a workflow program instead. Here is nextflow: https://www.nextflow.io/docs/latest/index.html

You could also use simple makefiles to great effect as demonstrated here by @Istvan --> bash program with two subprograms. How to use getopts in all of them?

ADD REPLY • link 2.0 years ago by GenoMax 147k

0

Entering edit mode

see my answer for Generating shell scripts

ADD REPLY • link 2.0 years ago by Pierre Lindenbaum 164k

score 16 · Answer 1 · 2022-10-25

The most important thing to consider is your level of expertise in bioinformatics.

As an educator with quite a bit of experience teaching bioinformatics, I believe that there are huge challenges when learning either NextFlow or Snakemake. These systems were developed for different needs than a typical life scientist has.

These tools are built by experts and for experts that know the commands and methods like the back of their hands. They have been doing bioinformatics for a long time and intimately understand the various interconnected datatypes, parameters, and ancillary information. Their problem is that they cannot scale their work up to hundreds of tasks with a flick of the wrist.

For a beginner, both NextFlow and Snakemake present a steep learning challenge as all their features, complexities, and abstraction are added on top of and in addition to doing bioinformatics.

GenoMax has already linked to a post I recently made here: bash program with two subprograms. How to use getopts in all of them? where I show how good old makefiles are simpler and more readable than either of these approaches

I strongly suspect that none of the people that recommend that you start with a full-grown pipeline engine have themselves learned to write pipelines that way. Most likely all of them started with the process I describe below. Thus I think you should do the same:

Learn to write bash scripts
Add the bioinformatics code you run into these bash scripts
Once you bash scripts work, learn to refactor your scripts to make them reusable. For example, move hardcoded paths into variables that can be easily edited
Learn to write documentation that explains in detail what each step does, why each parameter was chosen a certain way, and how your scripts should be used.
Now keep running your scripts and start the note the pain points; what is not sufficiently automated? What is overly automated and keeps causing troubles?

Once you have the above and your scripts run fine, you've already learned how to write a bioinformatics pipeline, now, learn to write a simple Makefile.

Once you have your scripts and/or your makefile, if you still feel that it is not automated enough, then pick snakemake or nextflow

score 4 · Answer 2 · 2022-10-25

You can use some of these pointers as aspirational goals for learning beyond the excellent starting steps that Istvan Albert suggests. In line with what Istvan is stressing, you'll note they are all components in a broader context of learning how to get science done with computation as a tool:

Titus Brown has some intro material that plugs in learning Snakemake using a fairly standard bioinformatics workflow:

snakemake for doing bioinformatics - a beginner's guide (part 1), Jan 2023 and part 2, Jan 2023
Intro to workflows for efficient automated data analysis, using snakemake
I rewrote my intro snakemake tutorial to fit a 3 hour workshop format, and to run on @mybinderteam - check it out! https://t.co/OIhKIS0OKa comments welcome here on Twitter, or there on hackmd!
— Titus Brown (@ctitusbrown) February 27, 2019
2019

GGG 298, Jan 2020 - Week 4 - snakemake for running workflows!
Titus' snakemake livestream jan 20 2021 https://hackmd.io/jXwbvOyQTqWqpuWwrpByHQ
Also from 2020 by Titus Brown: https://hackmd.io/SU2NB89JRu6fRPtSFizEEA?view

The Carpentries, which teaches foundational coding and data science skills to researches woldwide, has a in-depth Getting Started with Snakemake that is agnostic to application. If you know any python it really fills in how you can become a Snakemake superuser with just a little Python knowledge as a snakefile is superset of Python.

Pat Schloss' video series that recently went through using Snakemake:

I realized that I was working with old data that I had downloaded with the help of a Snakemake rule. So, how would I force Snakemake to rerun the rule and all of the other rules that depended on that file? Watch this Code Club to find out!https://t.co/PFE0eOwdr0 pic.twitter.com/WsgR26iz84
— Pat Schloss (@PatSchloss) October 17, 2022

Understanding Snakemake by Vince Buffalo

score 3 · Answer 3 · 2022-10-25

3

Entering edit mode

2.0 years ago

Matthias Zepper 4.9k

I think the tutorial Reproducible, scalable, and shareable analysis workflows with Nextflow is a very good introductory lesson to writing pipelines with Nextflow. If you are already familiar with Python, you might also want to look for beginner tutorials to Snakemake, but in general Nextflow is probably the most useful workflow language to learn in the biosciences for the foreseeable future.

ADD COMMENT • link 2.0 years ago by Matthias Zepper 4.9k

2

Entering edit mode

in general Nextflow is probably the most useful workflow language to learn in the biosciences for the foreseeable future

Do you have some references for the statement above? I have invested quite a bit in snakemake but I can be persuaded to switch. However, it seems to me that snakemake and nextflow are more or less similar in terms of capabilities and popularity (this is based on a rough look at google hits on biostars, stackoverflow, activity on github, etc).

ADD REPLY • link 2.0 years ago by dariober 15k

6

Entering edit mode

Just my two cents, based on the learnings from my unsuccessful attempts to found a start-up dedicated to genomic data analysis & visualization, as already briefly mentioned a few days ago.

My main audience with Nucleotidy were wet lab biologists and genetic counsellors and I wanted to provide them with user-friendly, yet flexible tools & mini-pipelines running entirely in the browser for generating biological meaning out of already aligned data, variant calls or count tables. Hence, there was no heavy lifting involved, and my eventual tech stack was build around software compiled to Webassembly, WebGL for data visualization and the pythonic Dagster for the workflows (which already comes with a nice free and open-source GUI called Dagit).

However, regardless where I pitched this concept praising all the nifty technical details, I was met with a complete lack of interest. I approached investors thinking that the more bleeding edge and visionary my technology would be, the better. I stand corrected. They unequivocally told me, that they will not even attempt to assess the technical merits of my product, which would be futile anyway - all that mattered to them was, how many active users there are and how big the existing community around this technology is. I understood, that from a business perspective, the originality of your product is much less important than your ability to build a community of users.

Frankly, I dislike Nextflow. Having no Java / Groovy background, there are a lot of things that seem not very intuitive to me and a few years back, when I first evaluated it, there were also not yet many conventions how to build "good" workflows with it. DSL1 was a blank canvas, and one could build very elegant or hideous workflows with it. But thanks to some early adopters among academic institutions (my current employer, NGI in Stockholm, was among them) and a bunch of very committed people who started and fostered the nf-core community together, this changed. One can still go completely freestyle in Nextflow, but a lot of consideration went into the sensible standards imposed by the nf-core pipelines and their canonical ways of doing things. Furthermore, the community created tooling, hundreds of readily available, freely usable modules and provided thousands of hours of free support. Just do a quick estimate how much money e.g. a company needs to invest to create similar goodies around their tech stack while paying bioinformaticians a competitive salary... so yes, in my opinion it is predominantly the nf-core community that turned Nextflow into a mature workflow language.

Admittedly, I am working now in a heavily Nextflow-biased environment, so it might have totally escaped me that a similarly thriving community around Snakemake exists, but I don't think so. Is there anything nf-core like for Snakemake?

If not, I think, Nextflow is pretty much unchallenged on its way to reach the tipping point and a critical mass of adopters - at least within the niche of biosciences and genomics in particular. Yet, there will always be alternatives to Nextflow, e.g. Reflow is more reasonable to learn if you want to work for Illumina/Grail specifically and also CWL/WDL are not dead yet. Nonetheless, any serious contender in the future will have to follow suit in terms of community building - technical merits or even one nifty killer feature alone will no longer suffice.

ADD REPLY • link 2.0 years ago by Matthias Zepper 4.9k

1

Entering edit mode

Anecdote instead of a reference: I am traditionally a Snakemake user, but I have been convinced to finally learn Nextflow based on the quality and ease of use of the nf-core pipelines. That said, you can use nf-core pipelines without actually learning Nextflow.

ADD REPLY • link 2.0 years ago by Dave Carlson ★ 1.9k

1

Entering edit mode

Academia and industry are heavily using Nextflow (check for job opportunities asking for nextflow experience, for example), and Nextflow supports pretty much all the main technologies you may think of when it comes to writing/managing pipelines. It's very well integrated with most of the main services in the field (Only Nvidia and Seqera Labs, creators of Nextflow, had support for Google Cloud Batch the second it was publicly released), and similarly for Illumina DRAGEN. Publicly released already with Nextflow support.

Nextflow has not only the community but numerous employees to work full time on it and tasks related to it. Most of the platforms to manage pipelines support Nextflow, though you have Nextflow Tower which is the best place to run/manage/monitor nextflow pipelines (for obvious reasons). Recently, more amazing features were released such as Wave and FusionFS creating a reasonable distance between Nextfow and any other pipeline orchestrator in the life sciences.

I could keep listing reasons, including nf-core as Dave mentioned, but I think this is enough. If you really need to do something serious in life sciences, I wouldn't think twice about what software I would use: Nextflow is my choice 😃

ADD REPLY • link 2.0 years ago by Marcel Ribeiro-Dantas ▴ 590

1

Entering edit mode

Going slightly off tangent, in my experience, I believe that WDL is actually a significantly easier language to begin writing pipelines than Nextflow. It's more intuitive (but also more limited, see the SPEC sheet), and doesn't have the groovy/java syntax and channels etc. which I find to be more complicated. You can deploy your WDL scripts rather easily with Cromwell on your local machine for example.

ADD REPLY • link 2.0 years ago by bompipi95 ▴ 170

2

Entering edit mode

In to the individual of which you are listening to has humble opinion the individual of which you are listening to think referring to the specific and current subject matter of this WDL language that of which is present in this time just too the creation of extending sentences to make longer and more complicated sentences that include many words

ADD REPLY • link 2.0 years ago by Pierre Lindenbaum 164k

2

Entering edit mode

I think I got the joke! I am so proud of myself. Furthermore I feel the need to add an explanation because on the internet, jokes/parodies don't always get through - (and if I got it wrong, it would be further evidence to that :-) - would also be somewhat embarrasing )

I believe Pierre is writing his answer as if it were written as a WDL specification.

ADD REPLY • link 2.0 years ago by Istvan Albert 101k

0

Entering edit mode

WDL is just tooooo verbose for me :-P https://github.com/broadinstitute/wdl/blob/develop/scripts/broad_pipelines/germline-short-variant-discovery/gvcf-generation-per-sample/1.0.0/GOTC_PairedEndSingleSampleWf.wdl

        ref_fasta = ref_fasta,
        ref_fasta_index = ref_fasta_index,
        ref_dict = ref_dict,
        ref_alt = ref_alt,
        ref_bwt = ref_bwt,
        ref_amb = ref_amb,
        ref_ann = ref_ann,
        ref_pac = ref_pac,
        ref_sa = ref_sa,

ADD REPLY • link 2.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

If we want to talk about simplicity, we may well use just the scripts or Make files :). On the other hand, if you want reproducibility, which is a must these days, or execution somewhere else than your local machine, you'll need something better. When it comes to these two issues, mostly, it's difficult to beat Nextflow. I agree Nextflow isn't the best and easiest solution for toy problems, though I would insist it's still fine.

ADD REPLY • link 2.0 years ago by Marcel Ribeiro-Dantas ▴ 590

2

Entering edit mode

I believe the argument NextFlow only scales up and not down is not born out of reality. I think the more complicated a nextflow pipeline is, the less likely (exponentially so) is that people understand it. For a beginner, the nfcore is a black box where they have little chance of understanding what is inside it. Instead of a clear-cut step-by-step stage they get a convoluted dependency graph.

When I look at a nextflow pipeline I have an extraordinarily hard time telling apart the steps and the order that each takes place. For example, here is a simple question does the RNA-Seq workflow below filter the count matrix to remove empty rows? If so what is the rule?

https://nf-co.re/rnaseq/3.9/usage

How could one find out if it does? Where would they need to look? How would they get there? What would they click on? I spent over 15 minutes and could not answer that question; all I ended up doing was following one link after another, going into a maze of included files.

In contrast, I can point a student to a deseq.R script that I gave them that has the following:

# At least 3 samples with a count of 10 or higher
keep <- rowSums(counts(dds) >= 10) >= 3
dds  <- dds[keep,]

When I teach students step-by-step analysis, most if not all can change their values. I don't think they could change the NextFlow pipeline to do the same.

How would one even rerun that step alone? Exploratory data analysis is the very essence of science - many times we don't know what we are looking for.

In my opinion, instead of over-automation, deseq and gene expression enrichment analyses should be run in RStudio while exploring the results.

I have come to believe that NextFlow is a tool for a company to run data analyses in a standardized way. It is not well suited for general science.

ADD REPLY • link 2.0 years ago by Istvan Albert 101k

3

Entering edit mode

When I look at a nextflow pipeline I have an extraordinarily hard time telling apart the steps and the order that each takes place

You're probably referring to old DSL1 pipelines, as in DSL2 the order is obvious. Steps are called in the order they're called in the workflow block.

process A {
  ...
}
process B {
  ...
}

workflow {
  A()
  B()
}

In contrast, I can point a student to a deseq.R script that I gave them that has the following

As for the comparison between a fully versioned, automated, reproducible and multi-cloud-HPC executable pipeline and an R script, I'm not surprised it's easier to understand what's going on in the R script. It's a toy situation, used for learning, and very limited for proper real-life scenarios. Of course, everyone is free to do science with an R script and have it forgotten in some drawer, without any reproducibility, but I believe this will become less and less acceptable.

I want to make it clear that I don't disagree with you when you say we shouldn't throw pipeline orchestrators at students who barely know what's programming or bioinformatics, just like we shouldn't throw Git or command line tools. But if they want to do it well, and properly, with time these are all a must.

As for your last paragrah, I totally disagree with you. I believe Nextflow to be more useful to academia and science in general than for industry and large groups/companies, where there are more suitable tools (or at least layers of tools) such as Nextflow Tower, among other.

ADD REPLY • link 2.0 years ago by Marcel Ribeiro-Dantas ▴ 590

0

Entering edit mode

Do you feel an R-based workflow language would be a good idea?

ADD REPLY • link 24 months ago by Dunois ★ 2.8k

2

Entering edit mode

I think there is already too many workflow languages. I would rather see efforts to improve the ones we already have available, unless you have some revolutionary thing in mind.

ADD REPLY • link 24 months ago by Marcel Ribeiro-Dantas ▴ 590

1

Entering edit mode

Dunois

Temporal is known to support a variety of programming languages via dedicated SDKs, so one can write Temporal workflows in e.g. Go, Java, Python, PHP, TypeScript, .NET, Rust, Ruby, Clojure and Scala instead of a dedicated DSL. They also actively invite the community to develop new SDKs, so you could write one for R.

However, I have doubts about how big the added value really is. In particular in science, the disadvantage of having to learn another DSL is marginalized by the benefits you get from openly collaborating with others on workflows.

The time you save by reusing modules or sub-workflows during development in my opinion justifies to agree on a specific language and also certain programming patterns - which may as well be a DSL but could of course also be e.g. Python.

In any way, one need to set standards to enable efficient collaboration: Suppose nf-core would be a collection of Temporal workflows and the RNA-seq pipeline was written in PHP, Sarek was written in Clojure and AIRR-seq in Go etc... I don't think there would be much collaboration possible.

ADD REPLY • link 24 months ago by Matthias Zepper 4.9k

1

Entering edit mode

You have a great point there, Matthias: Collaboration. Thanks for bringing this up!

As for the discussion on DSL, nf-core, and so on, I want to clarify here (for beginners who are new to nextflow and nf-core) that you can have any programming language in your nextflow pipeline. The Nextflow DSL has the sole purpose of describing your workflow and how it should behave, but the work is done in whatever language you want: Python, R, PHP, compiled tools, and so on. A lot of nextflow pipelines have tasks for a single pipeline being performed in many different programming languages plus many different compiled tools. We provide a language that is specifically tailored for building pipelines (Nextflow) and then you have all the other programming languages that are specifically tailored for other things.

This is obvious for people who know nextflow and nf-core, like us Matthias Zepper, but sometimes it is not very clear for people who are just starting, so I think it's important to emphasize that.

ADD REPLY • link 24 months ago by Marcel Ribeiro-Dantas ▴ 590

score 1 · Answer 4 · 2022-10-25

The best free online material about pipeline/workflow writing, in my opinion, is the Nextflow training built by Seqera Labs. Not only it's an amazing material, maintained by the company of the creators of Nextflow, but it also provides you a Gitpod environment to play with it for free, including examples with Docker, among other interesting tools. You don't have to install anything on your machine if you don't want to. You can check it out here. It's a bit long, but it covers a lot of the questions and use cases that people have at the beginning and even in intermediate-level situations. If you need some more advanced strategies, there is the Nextflow Pattern repository that contains many common situations people run into when they're trying to do something more complicated.