Tutorial: Use a workflow management tool to manage your computational pipelines
gravatar for Eric Lim
2.6 years ago by
Eric Lim1.4k
Eric Lim1.4k wrote:

Our group recently had to take over the development and maintenance of a collaborator's computational workflow. I received an email from the PI this morning with an attachment of what had been done. It is, I kid you not, 900 lines of pure torture in a single bash script. From nested if-else and for loops to bash scripts making bash scripts to more pipe than Mario can possibly handle, I wonder if the use of any workflow management tool is uncommon in our field.

I only started hanging out here a couple of weeks ago and I seldom see questions about the use of any workflow management system. I thought perhaps I could start one and hope the rest of the community would post code snippets to encourage others to start using one.

I have used Ruffus, Luigi/SciLuigi, NextFlow, and Make, but ultimately settled with Snakemake.

Here is a good collection of existing frameworks and libraries. Jeremy Leipzig, a veteran here, my ex officemate, recently published this paper.

I'm going to start by including a very simple Snakemake's workflow to turn RNASeq data from fastqs to sorted bam using STAR with mostly default parameters. It'd really be great if the community will start posting some examples of your favorite frameworks.

A simple Snakemake's workflow to process RNA-Seq using STAR.
import os

# define an ENCODE's library name and their associated accessions
# as an example for this workflow
samples = {
    'ENCLB555AOP': {
        '1': 'ENCFF000FXU',
        '2': 'ENCFF000FYF'

localrules: run_STAR
shell.prefix("source ~/.bash_profile; set -euo pipefail;")

rule run_STAR:
    Run simple STAR workflow.
    input: expand('results/{samples}/{samples}.star.sorted.bam', samples=samples.keys())

rule download_sample:
    Download RNA-Seq data from ENCODE.
    output: fq = '{prefix}/{sample}_{read}.fq.gz'
        acc = samples[wildcards.sample][wildcards.read]
        shell('wget -O {output.fq} https://www.encodeproject.org/files/{acc}/@@download/{acc}.fastq.gz')

rule download_ensembl_gtf:
    Download GRCh37 GTF from ensembl.
    output: gtf = 'references/ensembl.gtf'
    params: url = 'ftp://ftp.ensembl.org/pub/grch37/update/gtf/homo_sapiens/Homo_sapiens.GRCh37.82.gtf.gz'
              wget -O {output.gtf}.gz {params.url}
              pigz -d {output.gtf}.gz

rule download_ensembl_dna:
    Download genomic DNA sequences from ensembl.
    output: fa ='references/dna.fa'
    params: url = 'ftp://ftp.ensembl.org/pub/grch37/release-83/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz'
              wget -O {output.fa}.gz {params.url}
              pigz -d {output.fa}.gz

rule index_star:
    Index genomic DNA sequences using STAR.
    input: fa = rules.download_ensembl_dna.output.fa,
           gtf = rules.download_ensembl_gtf.output.gtf
    output: Genome = 'references/Genome'
    threads: 8
        res_dir = os.path.dirname(output.Genome)
              STAR --runMode genomeGenerate        \
                   --runThreadN {threads}          \
                   --genomeFastaFiles {input.fa}   \
                   --sjdbGTFfile {input.gtf}       \
                   --genomeDir {res_dir}/

rule align_star:
    Align sequencing reads using STAR.
    input: fq1 = '{prefix}_1.fq.gz',
           fq2 = '{prefix}_2.fq.gz',
           idx = rules.index_star.output.Genome
    output: bam = '{prefix}.star.bam',
            sj = '{prefix}.star.sj'
    threads: 8
        res_dir = os.path.dirname(output.bam)
        idx_dir = os.path.dirname(input.idx)
              STAR --runThreadN {threads}                 \
                   --runMode alignReads                   \
                   --readFilesCommand pigz -d -c          \
                   --outSAMtype BAM Unsorted              \
                   --genomeDir {idx_dir}/                 \
                   --outFileNamePrefix {res_dir}/         \
                   --readFilesIn {input.fq1} {input.fq2}
              mv {res_dir}/Aligned.out.bam {output.bam}
              mv {res_dir}/SJ.out.tab {output.sj}   

rule sort_bam:
    Sort bam by coordinates using sambamba.
    input: bam = '{prefix}.bam'
    output: bam = '{prefix}.sorted.bam'
    params: mem = '35G'
    threads: 8
        tmp_dir = os.path.dirname(output.bam)
        shell('sambamba sort --tmpdir {tmp_dir} -t {threads} -m {params.mem} -o {output.bam} {input.bam}')

Assuming you have a standalone machine with 32 cores.

snakemake -s wf.py run_STAR -j32

To submit the workflow to a SGE scheduler,

snakemake -s wf.py run_STAR -j999 -c "qsub -pe smp {threads}"

This example only uses some basic features of Snakemake. To learn more, visit here.

ADD COMMENTlink modified 8 months ago by alaincoletta140 • written 2.6 years ago by Eric Lim1.4k

Another really good place to start:


ADD REPLYlink written 2.6 years ago by Christian2.8k
gravatar for alaincoletta
8 months ago by
alaincoletta140 wrote:

For Nextflow, I always recommend to anybody get started with implementations patterns. There is a really nice repository with plenty of examples:


And maybe follow up with one of the workflows:


This post shows an example running a pipeline to process RNA-Seq Data (Download, align, count reads) for more than 40 samples (70 GB) in around 2h.


ADD COMMENTlink modified 8 months ago • written 8 months ago by alaincoletta140
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1371 users visited in the last hour