Question

Nextflow MultiQC runs multiple times due to FASTQC zip name collisions

0

Entering edit mode

19 hours ago

DdogBoss ▴ 30

I have a Nextflow workflow that runs FASTQC, STAR index, STAR align, and MULTIQC.

FASTQC produces two .zip and two .html files per sample (paired-end reads R1/R2). My FASTQC process emits the files with a pattern like *_fastqc.zip. When feeding all FASTQC .zip files (and STAR log files) to MULTIQC, I get this error:

Process `MULTIQC` input file name collision -- There are multiple input files for each of the following file names: *_fastqc.zip

Or there is an input file name collision for:

control1.zip, control2.zip, control3.zip, experiment1.zip, experiment2.zip, experiment3.zip

To get around this, I have tried to emit the zip and fastqc files as such:

tuple val(sample), path("${sample}_*_fastqc.zip"), emit: zip
tuple val(sample), path("${sample}_*_fastqc.html"), emit: html

or :

tuple val(sample), path('*.zip'), emit: zip
tuple val(sample), path('*.html'), emit: html

But this results in MultiQC running multiple times rather than once. The goal is to generate one MultiQC report for all log files and .zip files from FASTQC and a STAR align process when given an input tuple:

[sample1_R1_fastqc.zip, sample1_R2_fastqc.zip, sample1.Log.final.out,
sample2_R1_fastqc.zip, sample2_R2_fastqc.zip, sample2.Log.final.out, ...]

Current FASTQC process

process FASTQC {
publishDir params.outdir, mode: "copy", pattern: '*.html'
label 'process_low'

input:
tuple val(sample), path(fastq)

output:
tuple val(sample), path('*.zip'), emit: zip
tuple val(sample), path('*.html'), emit: html

script:
"""
fastqc $fastq -t $task.cpus
"""

stub:
"""
touch ${sample}_R1_fastqc.zip
touch ${sample}_R1_fastqc.html
touch ${sample}_R2_fastqc.zip
touch ${sample}_R2_fastqc.html
"""
}

Current MultiQC process

process MULTIQC {
publishDir params.outdir, mode: "copy", pattern: '*.html'
label 'process_low'

input:
path ('*')

output:
path('multiqc_report.html')

script:
"""
multiqc . 
"""

stub:
"""
touch multiqc_report.html
"""
}

Relevant part of the workflow

Channel.fromFilePairs(params.reads)
| flatMap { sample_id, reads ->
    reads.collect { read -> tuple(sample_id, read) }
}
| set { fastqc_channel }

FASTQC(fastqc_channel)
STAR(tuple(file(params.genome), file(params.gtf)))
STAR_ALIGN(STAR.out.index_dir, align_ch)

multiqc_ch = FASTQC.out.zip.map { it[1] } 
.mix(STAR_ALIGN.out.log.map { it[1] }) 
.collect()
.flatten()

multiqc_ch.view()
MULTIQC(multiqc_ch)

STAR align has the correct number of log files in the output. Output of STAR align takes the form:

tuple val(sample), path("*.log.final.out"), emit: log

How do I ensure that MultiQC runs once when given a single tuple of the zip and log files for all reads? Bear in mind that I have been doing dry runs.

fastqc nextflow multiqc • 135 views

ADD COMMENT • link updated 3 hours ago by Phil Ewels ★ 1.5k • written 19 hours ago by DdogBoss ▴ 30

score 2 · Accepted Answer · 2025-10-07

2

Entering edit mode

18 hours ago

DdogBoss ▴ 30

Since I was running in stub-run mode, the solution was to make unique names for the FASTQC process like this:

touch ${sample}_${read_id}_fastqc.zip
touch ${sample}_${read_id}_fastqc.html

ADD COMMENT • link 18 hours ago by DdogBoss ▴ 30

score 1 · Accepted Answer · 2025-10-08

There's an easier way than manually renaming the output files (though you can do this, and it works) - you can use the name (aka stageAs) attribute of path with a pattern. I'm struggling to find the right docs at the moment, but from memory ? does a number that increments for each input, ?? is the same but zero padded, and * is the original file path.

See many of the nf-core pipelines as an example, eg. here:

    input:
    path  multiqc_files, stageAs: "?/*"

This stages each set of MultiQC input files into numbered subdirectory which increment for each input, avoiding filename collisions.

Hope this helps!