Question

Merge fastq files across different lanes in Nextflow

0

Entering edit mode

6 months ago

jkim ▴ 170

Hello,

I'm trying to concatenate fastq files using Nextflow but I noticed that it doesn't seem to work the way I wanted it to be. I saw this post (Merge fastq files ) and basically copied it off.

nextflow.enable.dsl=2

def getLibraryId( prefix ){
  // fastqfile = ABC-S16_L001_R1_001.fastq.gz, ABC-S16_L002_R1_001.fastq.gz
  prefix.split("_")[0] 
}

//params.raw_data_dir = "rawdata/"

// Gather the pairs of R1/R2 according to sample ID
Channel
     .fromFilePairs(params.rawdata + '/*_R{1,2}*.fastq.gz', flat: true, checkExists: true)
     .map { prefix, R1, R2 -> tuple(getLibraryId(prefix), R1, R2) }
     .groupTuple().set{ files_channel }


process merge_lane {
    debug true
    tag "merging ${sample}"
    cpus 2
    memory '2 GB'
    time '2h'

    publishDir "${launchDir}/analysis/merge_lane", mode : "copy"

    input:
        tuple val(sample), path(R1), path(R2)
    output:
        path("${sample}_R1.fastq.gz")
        path("${sample}_R2.fastq.gz")
    script:
        """
        cat ${ R1.collect{ it }.sort().join(" ") } > ${sample}_R1.fastq.gz
        cat ${ R2.collect{ it }.sort().join(" ") } > ${sample}_R2.fastq.gz
        """
}

Nextflow generated .command.sh for each sample and I noticed that some of them didn't look right. For example:

This is what I wanted to do. cat Sample_L001_R1_001.fastq.gz Sample_L002_R1_001.fastq.gz > Sample_R1.fastq.gz

#!/bin/bash -ue
cat 6305-No_E_S23_L001_R1_001.fastq.gz 6305-No_E_S23_L002_R1_001.fastq.gz > 6305_R1.fastq.gz
cat 6305-No_E_S23_L001_R2_001.fastq.gz 6305-No_E_S23_L002_R2_001.fastq.gz > 6305_R2.fastq.gz

But for some reason, as you can see the script below, nextflow/groovy didn't seem to sort fastq files by name.

#!/bin/bash -ue
cat 6298-No_E_S16_L002_R1_001.fastq.gz 6298-No_E_S16_L001_R1_001.fastq.gz > 6298_R1.fastq.gz
cat 6298-No_E_S16_L001_R2_001.fastq.gz 6298-No_E_S16_L002_R2_001.fastq.gz > 6298_R2.fastq.gz

Could you advise me on how to prevent this in Nextflow?

Nextflow Fastq • 730 views

ADD COMMENT • link 6 months ago by jkim ▴ 170

2

Entering edit mode

6 months ago

Pierre Lindenbaum 161k

no tested, but merging can be parallelized for R1 and R2. Join on libraryId, later

params.rawdata="NO_FILE";

def getLibraryId( prefix ){
  prefix.split("_")[0] 
}

workflow {
    files_channel = Channel.fromFilePairs("${params.rawdata}/*_R{1,2}*.fastq.gz", flat: true, checkExists: true).
        map{prefix, R1, R2 -> tuple(getLibraryId(prefix), R1, R2)}.
        flatMap{libraryId,R1,R2 -> [
            [[libraryId,"R1"],R1],
            [[libraryId,"R2"],R2]
            ] }.
        groupTuple()

    merge_ch = MERGE(files_channel)

    concat_R1_ch = merge_ch.output.filter{K,FASTQ->K[1].equals("R1")}.map{K,FASTQ->[K[0],FASTQ]}
    concat_R2_ch = merge_ch.output.filter{K,FASTQ->K[1].equals("R2")}.map{K,FASTQ->[K[0],FASTQ]}

    pair_ch = concat_R1_ch.join(concat_R2_ch)

    pair_ch.view()
    }
process MERGE {
tag "${key} ${fastqs}"
input:
    tuple val(key),val(fastqs)
output:
    tuple val(key),path("${key[0]}.${key[1]}.fastq.gz"),emit:output
script:
    def ordered =fastqs.sort{A,B->A.name.compareTo(B.name)}.join(" ")
"""
cat ${ordered} > ${key[0]}.${key[1]}.fastq.gz
"""
}

ADD COMMENT • link 6 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks for sharing the code! I will be trying that.

ADD REPLY • link 6 months ago by jkim ▴ 170

score 1 · Accepted Answer · 2023-10-19

I have put this post on nextflow slack community and got some feedback. It looks like I would have to deal with groovy so I decided to use different stuff. There's some short and concise code. https://github.com/stephenturner/mergelanes#an-easier-way and I updated it a little bit.

#!/usr/bin/bash

if [ $# -ne 3 ]; then
  echo "bash this.sh [delimiter] [input_fq_dir] [merged_fq_dir]"
  exit 1
fi

set -eou pipefail

delimiter=$1
input_fq_dir=$2
merged_fq_dir=$3

mkdir -p $merged_fq_dir

ls $input_fq_dir/*R1* | cut -d $delimiter -f 1 | sort | uniq | sed "s/$input_fq_dir\///" \
    | while read id; do \
        echo $input_fq_dir/$id*R1*.fastq.gz --\> $merged_fq_dir/$id\_R1.fastq.gz;
        cat $input_fq_dir/$id*R1*.fastq.gz > $merged_fq_dir/$id\_R1.fastq.gz;
        echo $input_fq_dir/$id*R2*.fastq.gz --\>  $merged_fq_dir/$id\_R2.fastq.gz;
        cat $input_fq_dir/$id*R2*.fastq.gz > $merged_fq_dir/$id\_R2.fastq.gz;
      done