Facing issue with output of nextflow pipeline
3
0
Entering edit mode
3 months ago
harsh ▴ 20

I create a nexflow pipeline to run rna-seq preprocessing.

This is the error i am facing. Can anyone please help me to resolve this ?

ERROR ~ Error executing process > 'FastQC (1)'

Caused by:

      Missing output file(s) `*` expected by process `FastQC (1)` (note: input files are not included in the default matching set)

Command executed:

  mkdir -p /home/PDX_Data/data/output/fastqc
  fastqc --threads 12 -o /home/PDX_Data/data/output/fastqc ERR1084768_1.fastq.gz ERR1084768_2.fastq.gz 2> /home/PDX_Data/data/output/fastqc/error.log

Command exit status:
  0

Command output:
  application/gzip
  application/gzip
  Analysis complete for ERR1084768_1.fastq.gz
  Analysis complete for ERR1084768_2.fastq.gz

Work dir:
  /home/PDX_Data/data/work/a3/0ad1935f61de5cc612bae18d56a242
output nextflow fastqc issue • 1.4k views
ADD COMMENT
2
Entering edit mode
nextflow.enable.dsl=2

// Define parameters directly
params.reads     = '/home/PDX_Data/data/*_{1,2}.fastq.gz'
params.adapters  = '/home/miniconda3/share/trimmomatic-0.39-2/adapters/NexteraPE-PE.fa'
params.index     = '/home/ref/grch38/genome'
params.gtf       = '/home/ref/Homo_sapiens.GRCh38.113.gtf'
params.output    = '/home/PDX_Data/data/output'
params.threads   = 12

// Ensure output directories exist
process SetupDirectories {
    output:
    path params.output

    script:
    """
    mkdir -p ${params.output}/{fastqc,trimmed,hisat2,bam,counts}
    """
}

// Quality control process
process FastQC {
    input:
    tuple val(sample_id), path(reads)

    output:
    path "*"

    script:
    """
    mkdir -p ${params.output}/fastqc
    fastqc --threads ${params.threads} -o ${params.output}/fastqc ${reads} 2> ${params.output}/fastqc/error.log
    """
}

// Read trimming
process Trimmomatic {
    input:
    path reads

    output:
    path "*_paired.fq.gz"

    script:
    """
    trimmomatic PE -threads ${params.threads} \
        ${reads[0]} ${reads[1]} \
        ${params.output}/trimmed/paired_1.fq.gz ${params.output}/trimmed/unpaired_1.fq.gz \
        ${params.output}/trimmed/paired_2.fq.gz ${params.output}/trimmed/unpaired_2.fq.gz \
        ILLUMINACLIP:${params.adapters}:2:30:10 SLIDINGWINDOW:4:20 MINLEN:36
    """
}

// Alignment with HISAT2
process HISAT2 {
    input:
    path trimmed_reads

    output:
    path "*.sam"

    script:
    """
    hisat2 -p ${params.threads} -x ${params.index} -1 ${trimmed_reads[0]} -2 ${trimmed_reads[1]} -S ${params.output}/hisat2/output.sam
    """
}

// Convert and sort SAM to BAM
process SamtoolsSort {
    input:
    path sam_files

    output:
    path "*.bam"

    script:
    """
    samtools view -@ ${params.threads} -bS ${sam_files} | samtools sort -@ ${params.threads} -o ${params.output}/bam/output.sorted.bam
    """
}

// Feature counting
process FeatureCounts {
    input:
    path sorted_bam

    output:
    path "featureCounts.txt"

    script:
    """
    featureCounts -T ${params.threads} -a ${params.gtf} -o ${params.output}/counts/featureCounts.txt -p -B -C ${sorted_bam}
    """
}

workflow {
    SetupDirectories 
    reads_ch = Channel.fromFilePairs(params.reads, suffix: '_1.fastq.gz')
    reads_ch.view()
    reads_ch | FastQC | Trimmomatic | HISAT2 | SamtoolsSort | FeatureCounts
}
ADD REPLY
1
Entering edit mode
3 months ago

fastqc is not creating output (missing output files) which are expected by your output pattern.

Bugfixing

  • check the fastqc work directory to see what files are being created
  • try to set the output file expected to *.html
ADD COMMENT
0
Entering edit mode
(base) user@user-ProLiant-DL380-Gen9:~/PDX_Data/data/output/fastqc$ ls

ERR1084768_1_fastqc.html  ERR1084768_1_fastqc.zip  ERR1084768_2_fastqc.html  ERR1084768_2_fastqc.zip  error.log

Results are made but nextflow is not able to read them. I think nexftlow is trying to read it in work directory but results are in fastqc subdirectory under output directory. But i don't know to resolve this issue.

ADD REPLY
0
Entering edit mode

Please try a tree -h work (assuming your nextflow work dir is called work ).

ADD REPLY
1
Entering edit mode

Oh - best practice is not this

fastqc --threads ${params.threads} -o ${params.output}/fastqc ${reads} 2> ${params.output}/fastqc/error.log

#but this

fastqc --threads ${params.threads} -o fastqc ${reads} 

ie. don't try to tell nextflow where to create data. It will take care of data management in the work dirs completely. On process completion of each step, it will - if set - write files to the output directory. If you move/write data to output, nextflow will not be able to find that data to use as input in the next step.

ADD REPLY
0
Entering edit mode
12 weeks ago
mmhryc • 0

With nextflow you don't want to manage paths manually. Instead of creating a hardcoded path you should let it output the results into whatever directory it wants to, and catch them with an appropriate output declaration. Further processes should receive output by <process_name>.out or using the declared emit name. If you want to get the final results in a more convenient location the you should specify it using publishDir, preferentially using 'link' or 'symlink' modes (so you don't copy over large files).

Here's an example. I define FlyeTest process that will assemble CLR reads using flye. Flye's outputs the assembly into <dir>/assembly.fasta where you can specify <dir> with -o option, in my case it will be asm_out/assembly.fasta, and I tell nextflow to take this file as output. publishDir will create a hard link to FlyeTest output and place it in results directory.

A plain fromPath channel would create separate instances of FlyeTest() for each input file, with .collect() I can pass them as 1 array and .join(' ') them into a space separated string of paths.

I pass the value for read_path parameter as: 'data/Cell-?/seq-??.fastq.gz'. ? matches any digit, hence I can have up to 10 directories within data Cell-0 to Cell-9, each with up to 100 fastq files from seq-00 to seq-99.

    process FlyeTest {

    publishDir 'results', mode: 'link'

    input:
    path read_list

    output:
    path 'asm_out/assembly.fasta'

    script:
    """
    flye --pacbio-raw ${read_list.join(' ')} -o asm_out --threads ${task.cpus}
    ""
}

params.reads_path = './'
workflow {
    read_ch = channel.fromPath(params.reads_path)
        .collect()
        .view()
    FlyeTest(read_ch)
}


$ nextflow run FlyeTest.nf --reads_path 'data/Cell-?/seq-??.fastq.gz'

To make things cleaner I suggest setting params to generally acceptable default values (and if that's not possible adding checks) and writing a separate run.sh script with nextflow run <file_name>.nf --<param_name> <param_value> ...

ADD COMMENT
0
Entering edit mode
6 weeks ago
neng ▴ 50

This is a common issue that many learners encounter when working with Nextflow output management.

  1. Use publishDir("${params.outdir}/...", mode: 'copy') to copy output files to a designated directory.
  2. Use stdout instead of specifying individual output files(inlucding *)—this helps avoid missing any files that are generated dynamically or unexpectedly.
    process RNAseq_quality_check {
    publishDir ("${params.outdir}/01_quality_check", mode: 'copy') 
    input:          
        tuple val(sample_id), path(reads)
    output:
        stdout
    script:
    """
    mkdir -p ${params.outdir}/01_quality_check;
    fastqc ${reads[0]} ${reads[1]} -o ${params.outdir}/01_quality_check -t 8;
    """
    }
ADD COMMENT
0
Entering edit mode

It looks hacky. Maybe it "works" but should not be taken as inspiration I guess. Are you sure output could be redicted to other processed like that? I mean, fastqc is a dead end, it's output is never needed downstream, but if this was a process emitting a bam file, would that allow redirection to a downstream process?

ADD REPLY
0
Entering edit mode

Yeah, this looks like a hack.This page has more standard patterns - https://nextflow-io.github.io/patterns/

  • You don't need the mkdir command in the script block if you use publishDir
  • You're forcing fastqc to directly write to the params.outdir, not the default nextflow work directory. I like to keep work completely transient and I delete them all the time to save diskspace.
  • Outputting files to files (in nextflow path) and not stdout would appear to be common sense
ADD REPLY

Login before adding your answer.

Traffic: 2660 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6