I have a Nextflow workflow that runs FASTQC, STAR index, STAR align, and MULTIQC.
FASTQC produces two .zip and two .html files per sample (paired-end reads R1/R2). My FASTQC process emits the files with a pattern like *_fastqc.zip. When feeding all FASTQC .zip files (and STAR log files) to MULTIQC, I get this error:
Process `MULTIQC` input file name collision -- There are multiple input files for each of the following file names: *_fastqc.zip
Or there is an input file name collision for:
control1.zip, control2.zip, control3.zip, experiment1.zip, experiment2.zip, experiment3.zip
To get around this, I have tried to emit the zip and fastqc files as such:
tuple val(sample), path("${sample}_*_fastqc.zip"), emit: zip
tuple val(sample), path("${sample}_*_fastqc.html"), emit: html
or :
tuple val(sample), path('*.zip'), emit: zip
tuple val(sample), path('*.html'), emit: html
But this results in MultiQC running multiple times rather than once. The goal is to generate one MultiQC report for all log files and .zip files from FASTQC and a STAR align process when given an input tuple:
[sample1_R1_fastqc.zip, sample1_R2_fastqc.zip, sample1.Log.final.out,
sample2_R1_fastqc.zip, sample2_R2_fastqc.zip, sample2.Log.final.out, ...]
Current FASTQC process
process FASTQC {
publishDir params.outdir, mode: "copy", pattern: '*.html'
label 'process_low'
input:
tuple val(sample), path(fastq)
output:
tuple val(sample), path('*.zip'), emit: zip
tuple val(sample), path('*.html'), emit: html
script:
"""
fastqc $fastq -t $task.cpus
"""
stub:
"""
touch ${sample}_R1_fastqc.zip
touch ${sample}_R1_fastqc.html
touch ${sample}_R2_fastqc.zip
touch ${sample}_R2_fastqc.html
"""
}
Current MultiQC process
process MULTIQC {
publishDir params.outdir, mode: "copy", pattern: '*.html'
label 'process_low'
input:
path ('*')
output:
path('multiqc_report.html')
script:
"""
multiqc .
"""
stub:
"""
touch multiqc_report.html
"""
}
Relevant part of the workflow
Channel.fromFilePairs(params.reads)
| flatMap { sample_id, reads ->
reads.collect { read -> tuple(sample_id, read) }
}
| set { fastqc_channel }
FASTQC(fastqc_channel)
STAR(tuple(file(params.genome), file(params.gtf)))
STAR_ALIGN(STAR.out.index_dir, align_ch)
multiqc_ch = FASTQC.out.zip.map { it[1] }
.mix(STAR_ALIGN.out.log.map { it[1] })
.collect()
.flatten()
multiqc_ch.view()
MULTIQC(multiqc_ch)
STAR align has the correct number of log files in the output. Output of STAR align takes the form:
tuple val(sample), path("*.log.final.out"), emit: log
How do I ensure that MultiQC runs once when given a single tuple of the zip and log files for all reads? Bear in mind that I have been doing dry runs.