Question

Nextflow: How to format input tuple for STAR_ALIGN process with STAR index directory

0

Entering edit mode

1 day ago

DdogBoss ▴ 20

Hi all,

I am trying to pass the output of a STAR genome index process to a STAR alignment process in Nextflow, but I keep running into tuple/variable issues. Here’s a minimal reproducible example of my setup.

STAR index process

    process STAR {
    publishDir params.outdir, mode: "copy"
    label 'process_high'

    input:
        tuple path(reference), path(gtf)

    output:
         path 'STAR_index/' emit: index_dir

    script:
    """
    mkdir -p STAR_index
    STAR --runThreadN $task.cpus \
         --runMode genomeGenerate \
         --genomeDir STAR_index \
         --genomeFastaFiles $reference \
         --sjdbGTFfile $gtf
    """
    }

Workflow snippet

    // Channel with sample reads
    Channel.fromFilePairs(params.reads)
    | set { align_ch }
    // STAR index
    star_index_result = STAR(star_index_ch)
    // Flatten reads and prepare tuples
    pre_aligned_input_ch = align_ch.map { sample, reads ->
    tuple(sample, reads.toArray())   
    }
    aligned_input = pre_aligned_input_ch.combine(star_index_result.index_dir) { sample_tuple, index_dir ->
    def sample = sample_tuple[0]
    def reads  = sample_tuple[1]
    tuple(index_dir, sample, *reads)  
    }
    STAR_ALIGN(aligned_input)

align_ch returns tuples like [sample_name, [read1, read2]]. I want aligned_input tuples to look like [index_dir, sample1, read1, read2].

Reads are structured like *_{R1,R2}.subset.fastq.gz where the wildcard is the sample name.

STAR align process

    process STAR_ALIGN {
    publishDir params.outdir, mode: "copy"
    label 'process_high'

    input:
    tuple path(index_dir), val(sample), path(reads)

    output:
    tuple val(sample), path("*.bam"), emit: bam
    tuple val(sample), path("*.Log.final.out"), emit: log

    script:
    """
    STAR \
        --runThreadN $task.cpus \
        --genomeDir ${index_dir} \
        --readFilesIn ${reads.join(' ')} \
        --readFilesCommand zcat \
        --outFileNamePrefix ${sample}_ \
        --outSAMtype BAM SortedByCoordinate \
        2> ${sample}.Log.final.out
    """
    }

Problem The tuples I am feeding into the STAR align process are not valid, and I either get this error:

    ERROR ~ No such variable: path -- Check script 'modules/star/main.nf' at line: 12 or see '.nextflow.log' file for more details

or a DataflowVariable error, or not a valid path.

My goal is to massage the input to STAR_ALIGN so that it receives a tuple like this:

    [index_dir, sample_name, [read1, read2]]

Current attempts with .combine() and *reads either throw a DataflowVariable error or the path variable is missing.

Question How can I properly construct a Nextflow channel/tuple so that each sample is paired with the STAR index directory and the list of reads, in a format acceptable for the STAR_ALIGN process input:

tuple path(index_dir), val(sample), path(reads)

nextflow • 161 views

ADD COMMENT • link 4 hours ago by DdogBoss ▴ 20

1

Entering edit mode

20 hours ago

Pierre Lindenbaum 166k

align_ch returns tuples like [sample_name, [read1, read2]]. I want aligned_input tuples to look like [index_dir, sample1, read1, read2].

STAR_ALIGN(
   star_index_result.combine(
     align_ch.map{sn,reads->[sn,reads[0],reads[1]]}
        ) )

and STAR_ALIGN :

input:
   tuple path(index_dir), val(sample), path(R1),path(R2)

ADD COMMENT • link 20 hours ago by Pierre Lindenbaum 166k

score 1 · Accepted Answer · 2025-10-06

I needed to be aware of queue channels which you can read about here:

https://training.nextflow.io/2.0/basic_training/channels/#queue-channel

The way I was initially structuring the pass to the STAR index process, which was not shown above in the initial OP, was something like this:

Channel.of([params.genome, params.gtf])
   | set {star_index_ch}

And then passing star_index_ch to the STAR index process. Doing it this way means that the index produced from STAR gets "used up" once passed to the STAR align process so that it only gets passed with one sample.

Instead, I needed to pass the genome and gtf more explicitly as a tuple like this:

STAR(tuple(file(params.genome), file(params.gtf)))

This way, the index produced from the STAR index process is able to be used multiple times by the STAR align process. Not passing the inputs to STAR index as a queue channel will result in the index being attached to the sample name and reads with the proper number of alignment calls.