Question

nextflow DSL2: best practice to design and reuse a process/worfklow

2

Entering edit mode

2.4 years ago

Pierre Lindenbaum 161k

Let's say I want to genotype a set of BAMs using GATK. A basic DSL2 nextflow workflow would look like:

workflow {
    take:
        reference
        beds
        bams
    main:
           hc = haplotypecaller(reference,bams.combine(beds))
           bed2vcf = combinegvcf(hc.groupTuple())
           vcf = gathervcfs(bed2vcf.collect())
}

process haplotypecaller {
input:
    val(reference)
    tuple val(bam),val(bed)
output:
    tuple bed,path("sample.g.vcf.gz")
script:
   """
   gatk HaplotypeCaller -R ${reference} -I ${bam} -L ${bed} -ERC GVCF -O sample.g.vcf.gz
   """
}

process combinegvcf {
input:
    tuple val(bed),val(gvcfs)
ouput:

script:
"""
(...)
"""
}

process gathervcfs {
input:
   val(vcfs)
ouput:
   path("final.vcf.gz")
script:
"""
(...)
"""
}

but then I'm asked to run this workflow for a set of BAMS that come from different mappers (bwa, bowtie). hum... ok , easy, I can add a mapper in the input tuple

    tuple val(bam),val(bed),val(mapper)

and use a composite key when using operators like groupTuple()

but then I'm asked to run this new workflow with a ploidy parameter that will change with the sex (female,male) and the bed (PAR/X/Y/autosome),

but then I'm asked to run this new new workflow with various values for --min-mapping-quality, but then etc... etc...

So my question is: what is the best practice to design and reuse a process/sub-workflow. My feeling is to use an associative map to store the parameters but than how to handle this map in the output ? how can I reuse things after groupTuple()

dsl2 nextflow workflow • 3.6k views

ADD COMMENT • link updated 2.4 years ago by ATpoint 81k • written 2.4 years ago by Pierre Lindenbaum 161k

score 4 · Answer 1 · 2021-12-02

Dear Pierre,

I had the same problem with my pipelines. So I decided to build my modules - sub workflows in this way:


params.EXTRAPARS = ""

process map {

    input:
    tuple val(pair_id), path(reads)
    path(indexes)

    output:
    tuple val(pair_id), path("${pair_id}.bam") 

    script:
    def indexname = indexes[0].baseName

    """    
    bwa mem -t ${task.cpus} ${params.EXTRAPARS} ${indexname} ${reads} | samtools view -@ ${task.cpus} -Sb > ${pair_id}.bam
    """
}

So when I include them within the main.nf file I write:

include { MAP } from "${moduleFolder}/alignment/bwa" addParams(EXTRAPARS: "--my favorite params --wtc --ciao")

I also made some code for reading a tab separated file with custom command line options for each tool. So that you can change them without touching the pipeline code. You might have a look at this pipeline:

https://github.com/biocorecrg/MOP2/blob/main/mop_preprocess/mop_preprocess.nf

while I'm collecting the modules / subworkflow here:

https://github.com/biocorecrg/BioNextflow

I usually add them to my pipelines via git submodules

Best,

Luca

score 2 · Answer 2 · 2021-12-02

...not that I have a good answer, but I think the Map idea is the way to go, pretty much what the nf-core meta map that is associated with every module is doing, to parse things like strandedness, paired/singleend information etc, so the process then detects these information and runs the job accordingly. I would check how they use the saveAs parameter in the publishDir directive for some inspiration, e.g. as in this samtools module.

score 1 · Answer 3 · 2021-12-02

1

Entering edit mode

2.4 years ago

Sam ★ 4.7k

I have encountered a roughly similar problem and I like @ATpoint idea on using map. For my work, I need to combine different processes based on different values quite often. So at the end, I wrote up a little helper module here

ADD COMMENT • link 2.4 years ago by Sam ★ 4.7k