Nextflow MultiQC runs multiple times due to FASTQC zip name collisions
2
0
Entering edit mode
19 hours ago
DdogBoss ▴ 30

I have a Nextflow workflow that runs FASTQC, STAR index, STAR align, and MULTIQC.

FASTQC produces two .zip and two .html files per sample (paired-end reads R1/R2). My FASTQC process emits the files with a pattern like *_fastqc.zip. When feeding all FASTQC .zip files (and STAR log files) to MULTIQC, I get this error:

Process `MULTIQC` input file name collision -- There are multiple input files for each of the following file names: *_fastqc.zip

Or there is an input file name collision for:

control1.zip, control2.zip, control3.zip, experiment1.zip, experiment2.zip, experiment3.zip

To get around this, I have tried to emit the zip and fastqc files as such:

tuple val(sample), path("${sample}_*_fastqc.zip"), emit: zip
tuple val(sample), path("${sample}_*_fastqc.html"), emit: html

or :

tuple val(sample), path('*.zip'), emit: zip
tuple val(sample), path('*.html'), emit: html

But this results in MultiQC running multiple times rather than once. The goal is to generate one MultiQC report for all log files and .zip files from FASTQC and a STAR align process when given an input tuple:

[sample1_R1_fastqc.zip, sample1_R2_fastqc.zip, sample1.Log.final.out,
sample2_R1_fastqc.zip, sample2_R2_fastqc.zip, sample2.Log.final.out, ...]

Current FASTQC process

process FASTQC {
publishDir params.outdir, mode: "copy", pattern: '*.html'
label 'process_low'

input:
tuple val(sample), path(fastq)

output:
tuple val(sample), path('*.zip'), emit: zip
tuple val(sample), path('*.html'), emit: html

script:
"""
fastqc $fastq -t $task.cpus
"""

stub:
"""
touch ${sample}_R1_fastqc.zip
touch ${sample}_R1_fastqc.html
touch ${sample}_R2_fastqc.zip
touch ${sample}_R2_fastqc.html
"""
}

Current MultiQC process

process MULTIQC {
publishDir params.outdir, mode: "copy", pattern: '*.html'
label 'process_low'

input:
path ('*')

output:
path('multiqc_report.html')

script:
"""
multiqc . 
"""

stub:
"""
touch multiqc_report.html
"""
}

Relevant part of the workflow

Channel.fromFilePairs(params.reads)
| flatMap { sample_id, reads ->
    reads.collect { read -> tuple(sample_id, read) }
}
| set { fastqc_channel }

FASTQC(fastqc_channel)
STAR(tuple(file(params.genome), file(params.gtf)))
STAR_ALIGN(STAR.out.index_dir, align_ch)

multiqc_ch = FASTQC.out.zip.map { it[1] } 
.mix(STAR_ALIGN.out.log.map { it[1] }) 
.collect()
.flatten()

multiqc_ch.view()
MULTIQC(multiqc_ch)

STAR align has the correct number of log files in the output. Output of STAR align takes the form:

tuple val(sample), path("*.log.final.out"), emit: log

How do I ensure that MultiQC runs once when given a single tuple of the zip and log files for all reads? Bear in mind that I have been doing dry runs.

fastqc nextflow multiqc • 135 views
ADD COMMENT
2
Entering edit mode
18 hours ago
DdogBoss ▴ 30

Since I was running in stub-run mode, the solution was to make unique names for the FASTQC process like this:

touch ${sample}_${read_id}_fastqc.zip
touch ${sample}_${read_id}_fastqc.html
ADD COMMENT
1
Entering edit mode
3 hours ago
Phil Ewels ★ 1.5k

There's an easier way than manually renaming the output files (though you can do this, and it works) - you can use the name (aka stageAs) attribute of path with a pattern. I'm struggling to find the right docs at the moment, but from memory ? does a number that increments for each input, ?? is the same but zero padded, and * is the original file path.

See many of the nf-core pipelines as an example, eg. here:

    input:
    path  multiqc_files, stageAs: "?/*"

This stages each set of MultiQC input files into numbered subdirectory which increment for each input, avoiding filename collisions.

Hope this helps!

ADD COMMENT

Login before adding your answer.

Traffic: 3659 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6