Our group recently had to take over the development and maintenance of a collaborator's computational workflow. I received an email from the PI this morning with an attachment of what had been done. It is, I kid you not, 900 lines of pure torture in a single bash script. From nested if-else
and for
loops to bash scripts making bash scripts to more pipe
than Mario can possibly handle, I wonder if the use of any workflow management tool is uncommon in our field.
I only started hanging out here a couple of weeks ago and I seldom see questions about the use of any workflow management system. I thought perhaps I could start one and hope the rest of the community would post code snippets to encourage others to start using one.
I have used Ruffus, Luigi/SciLuigi, NextFlow, and Make, but ultimately settled with Snakemake.
Here is a good collection of existing frameworks and libraries. Jeremy Leipzig, a veteran here, my ex officemate, recently published this paper.
I'm going to start by including a very simple Snakemake's workflow to turn RNASeq data from fastqs
to sorted bam
using STAR with mostly default parameters. It'd really be great if the community will start posting some examples of your favorite frameworks.
"""
wf.py
A simple Snakemake's workflow to process RNA-Seq using STAR.
"""
import os
# define an ENCODE's library name and their associated accessions
# as an example for this workflow
samples = {
'ENCLB555AOP': {
'1': 'ENCFF000FXU',
'2': 'ENCFF000FYF'
}
}
localrules: run_STAR
shell.prefix("source ~/.bash_profile; set -euo pipefail;")
rule run_STAR:
"""
Run simple STAR workflow.
"""
input: expand('results/{samples}/{samples}.star.sorted.bam', samples=samples.keys())
rule download_sample:
"""
Download RNA-Seq data from ENCODE.
"""
output: fq = '{prefix}/{sample}_{read}.fq.gz'
run:
acc = samples[wildcards.sample][wildcards.read]
shell('wget -O {output.fq} https://www.encodeproject.org/files/{acc}/@@download/{acc}.fastq.gz')
rule download_ensembl_gtf:
"""
Download GRCh37 GTF from ensembl.
"""
output: gtf = 'references/ensembl.gtf'
params: url = 'ftp://ftp.ensembl.org/pub/grch37/update/gtf/homo_sapiens/Homo_sapiens.GRCh37.82.gtf.gz'
run:
shell("""
wget -O {output.gtf}.gz {params.url}
pigz -d {output.gtf}.gz
""")
rule download_ensembl_dna:
"""
Download genomic DNA sequences from ensembl.
"""
output: fa ='references/dna.fa'
params: url = 'ftp://ftp.ensembl.org/pub/grch37/release-83/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz'
run:
shell("""
wget -O {output.fa}.gz {params.url}
pigz -d {output.fa}.gz
""")
rule index_star:
"""
Index genomic DNA sequences using STAR.
"""
input: fa = rules.download_ensembl_dna.output.fa,
gtf = rules.download_ensembl_gtf.output.gtf
output: Genome = 'references/Genome'
threads: 8
run:
res_dir = os.path.dirname(output.Genome)
shell("""
STAR --runMode genomeGenerate \
--runThreadN {threads} \
--genomeFastaFiles {input.fa} \
--sjdbGTFfile {input.gtf} \
--genomeDir {res_dir}/
""")
rule align_star:
"""
Align sequencing reads using STAR.
"""
input: fq1 = '{prefix}_1.fq.gz',
fq2 = '{prefix}_2.fq.gz',
idx = rules.index_star.output.Genome
output: bam = '{prefix}.star.bam',
sj = '{prefix}.star.sj'
threads: 8
run:
res_dir = os.path.dirname(output.bam)
idx_dir = os.path.dirname(input.idx)
shell("""
STAR --runThreadN {threads} \
--runMode alignReads \
--readFilesCommand pigz -d -c \
--outSAMtype BAM Unsorted \
--genomeDir {idx_dir}/ \
--outFileNamePrefix {res_dir}/ \
--readFilesIn {input.fq1} {input.fq2}
mv {res_dir}/Aligned.out.bam {output.bam}
mv {res_dir}/SJ.out.tab {output.sj}
""")
rule sort_bam:
"""
Sort bam by coordinates using sambamba.
"""
input: bam = '{prefix}.bam'
output: bam = '{prefix}.sorted.bam'
params: mem = '35G'
threads: 8
run:
tmp_dir = os.path.dirname(output.bam)
shell('sambamba sort --tmpdir {tmp_dir} -t {threads} -m {params.mem} -o {output.bam} {input.bam}')
Assuming you have a standalone machine with 32 cores.
snakemake -s wf.py run_STAR -j32
To submit the workflow to a SGE scheduler,
snakemake -s wf.py run_STAR -j999 -c "qsub -pe smp {threads}"
This example only uses some basic features of Snakemake. To learn more, visit here.
Another really good place to start:
https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems