nextflow DSL2: best practice to design and reuse a process/worfklow
3
2
Entering edit mode
2.8 years ago

Let's say I want to genotype a set of BAMs using GATK. A basic DSL2 nextflow workflow would look like:

workflow {
    take:
        reference
        beds
        bams
    main:
           hc = haplotypecaller(reference,bams.combine(beds))
           bed2vcf = combinegvcf(hc.groupTuple())
           vcf = gathervcfs(bed2vcf.collect())
}

process haplotypecaller {
input:
    val(reference)
    tuple val(bam),val(bed)
output:
    tuple bed,path("sample.g.vcf.gz")
script:
   """
   gatk HaplotypeCaller -R ${reference} -I ${bam} -L ${bed} -ERC GVCF -O sample.g.vcf.gz
   """
}

process combinegvcf {
input:
    tuple val(bed),val(gvcfs)
ouput:

script:
"""
(...)
"""
}

process gathervcfs {
input:
   val(vcfs)
ouput:
   path("final.vcf.gz")
script:
"""
(...)
"""
}

but then I'm asked to run this workflow for a set of BAMS that come from different mappers (bwa, bowtie). hum... ok , easy, I can add a mapper in the input tuple

    tuple val(bam),val(bed),val(mapper)

and use a composite key when using operators like groupTuple()

but then I'm asked to run this new workflow with a ploidy parameter that will change with the sex (female,male) and the bed (PAR/X/Y/autosome),

but then I'm asked to run this new new workflow with various values for --min-mapping-quality, but then etc... etc...

So my question is: what is the best practice to design and reuse a process/sub-workflow. My feeling is to use an associative map to store the parameters but than how to handle this map in the output ? how can I reuse things after groupTuple()

dsl2 nextflow workflow • 3.8k views
ADD COMMENT
4
Entering edit mode
2.8 years ago
lucacozzuto ▴ 40

Dear Pierre,

I had the same problem with my pipelines. So I decided to build my modules - sub workflows in this way:


params.EXTRAPARS = ""

process map {

    input:
    tuple val(pair_id), path(reads)
    path(indexes)

    output:
    tuple val(pair_id), path("${pair_id}.bam") 

    script:
    def indexname = indexes[0].baseName

    """    
    bwa mem -t ${task.cpus} ${params.EXTRAPARS} ${indexname} ${reads} | samtools view -@ ${task.cpus} -Sb > ${pair_id}.bam
    """
}

So when I include them within the main.nf file I write:

include { MAP } from "${moduleFolder}/alignment/bwa" addParams(EXTRAPARS: "--my favorite params --wtc --ciao")

I also made some code for reading a tab separated file with custom command line options for each tool. So that you can change them without touching the pipeline code. You might have a look at this pipeline:

https://github.com/biocorecrg/MOP2/blob/main/mop_preprocess/mop_preprocess.nf

while I'm collecting the modules / subworkflow here:

https://github.com/biocorecrg/BioNextflow

I usually add them to my pipelines via git submodules

Best,

Luca

ADD COMMENT
2
Entering edit mode
2.8 years ago
ATpoint 84k

...not that I have a good answer, but I think the Map idea is the way to go, pretty much what the nf-core meta map that is associated with every module is doing, to parse things like strandedness, paired/singleend information etc, so the process then detects these information and runs the job accordingly. I would check how they use the saveAs parameter in the publishDir directive for some inspiration, e.g. as in this samtools module.

ADD COMMENT
1
Entering edit mode
2.8 years ago
Sam ★ 4.8k

I have encountered a roughly similar problem and I like @ATpoint idea on using map. For my work, I need to combine different processes based on different values quite often. So at the end, I wrote up a little helper module here

ADD COMMENT

Login before adding your answer.

Traffic: 1784 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6