Nextflow MultiQC runs multiple times due to FASTQC zip name collisions
1
0
Entering edit mode
4 hours ago
DdogBoss ▴ 20

I have a Nextflow workflow that runs FASTQC, STAR index, STAR align, and MULTIQC.

FASTQC produces two .zip and two .html files per sample (paired-end reads R1/R2). My FASTQC process emits the files with a pattern like *_fastqc.zip. When feeding all FASTQC .zip files (and STAR log files) to MULTIQC, I get this error:

Process `MULTIQC` input file name collision -- There are multiple input files for each of the following file names: *_fastqc.zip

Or there is an input file name collision for:

control1.zip, control2.zip, control3.zip, experiment1.zip, experiment2.zip, experiment3.zip

To get around this, I have tried to emit the zip and fastqc files as such:

tuple val(sample), path("${sample}_*_fastqc.zip"), emit: zip
tuple val(sample), path("${sample}_*_fastqc.html"), emit: html

or :

tuple val(sample), path('*.zip'), emit: zip
tuple val(sample), path('*.html'), emit: html

But this results in MultiQC running multiple times rather than once. The goal is to generate one MultiQC report for all log files and .zip files from FASTQC and a STAR align process when given an input tuple:

[sample1_R1_fastqc.zip, sample1_R2_fastqc.zip, sample1.Log.final.out,
sample2_R1_fastqc.zip, sample2_R2_fastqc.zip, sample2.Log.final.out, ...]

Current FASTQC process

process FASTQC {
publishDir params.outdir, mode: "copy", pattern: '*.html'
label 'process_low'

input:
tuple val(sample), path(fastq)

output:
tuple val(sample), path('*.zip'), emit: zip
tuple val(sample), path('*.html'), emit: html

script:
"""
fastqc $fastq -t $task.cpus
"""

stub:
"""
touch ${sample}_R1_fastqc.zip
touch ${sample}_R1_fastqc.html
touch ${sample}_R2_fastqc.zip
touch ${sample}_R2_fastqc.html
"""
}

Current MultiQC process

process MULTIQC {
publishDir params.outdir, mode: "copy", pattern: '*.html'
label 'process_low'

input:
path ('*')

output:
path('multiqc_report.html')

script:
"""
multiqc . 
"""

stub:
"""
touch multiqc_report.html
"""
}

Relevant part of the workflow

Channel.fromFilePairs(params.reads)
| flatMap { sample_id, reads ->
    reads.collect { read -> tuple(sample_id, read) }
}
| set { fastqc_channel }

FASTQC(fastqc_channel)
STAR(tuple(file(params.genome), file(params.gtf)))
STAR_ALIGN(STAR.out.index_dir, align_ch)

multiqc_ch = FASTQC.out.zip.map { it[1] } 
.mix(STAR_ALIGN.out.log.map { it[1] }) 
.collect()
.flatten()

multiqc_ch.view()
MULTIQC(multiqc_ch)

STAR align has the correct number of log files in the output. Output of STAR align takes the form:

tuple val(sample), path("*.log.final.out"), emit: log

How do I ensure that MultiQC runs once when given a single tuple of the zip and log files for all reads? Bear in mind that I have been doing dry runs.

fastqc nextflow multiqc • 59 views
ADD COMMENT
1
Entering edit mode
3 hours ago
DdogBoss ▴ 20

Since I was running in stub-run mode, the solution was to make unique names for the FASTQC process like this:

touch ${sample}_${read_id}_fastqc.zip
touch ${sample}_${read_id}_fastqc.html
ADD COMMENT

Login before adding your answer.

Traffic: 3336 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6