Merge fastq files across different lanes in Nextflow
2
0
Entering edit mode
6 months ago
jkim ▴ 170

Hello,

I'm trying to concatenate fastq files using Nextflow but I noticed that it doesn't seem to work the way I wanted it to be. I saw this post (Merge fastq files ) and basically copied it off.

nextflow.enable.dsl=2

def getLibraryId( prefix ){
  // fastqfile = ABC-S16_L001_R1_001.fastq.gz, ABC-S16_L002_R1_001.fastq.gz
  prefix.split("_")[0] 
}

//params.raw_data_dir = "rawdata/"

// Gather the pairs of R1/R2 according to sample ID
Channel
     .fromFilePairs(params.rawdata + '/*_R{1,2}*.fastq.gz', flat: true, checkExists: true)
     .map { prefix, R1, R2 -> tuple(getLibraryId(prefix), R1, R2) }
     .groupTuple().set{ files_channel }


process merge_lane {
    debug true
    tag "merging ${sample}"
    cpus 2
    memory '2 GB'
    time '2h'

    publishDir "${launchDir}/analysis/merge_lane", mode : "copy"

    input:
        tuple val(sample), path(R1), path(R2)
    output:
        path("${sample}_R1.fastq.gz")
        path("${sample}_R2.fastq.gz")
    script:
        """
        cat ${ R1.collect{ it }.sort().join(" ") } > ${sample}_R1.fastq.gz
        cat ${ R2.collect{ it }.sort().join(" ") } > ${sample}_R2.fastq.gz
        """
}

Nextflow generated .command.sh for each sample and I noticed that some of them didn't look right. For example:

This is what I wanted to do. cat Sample_L001_R1_001.fastq.gz Sample_L002_R1_001.fastq.gz > Sample_R1.fastq.gz

#!/bin/bash -ue
cat 6305-No_E_S23_L001_R1_001.fastq.gz 6305-No_E_S23_L002_R1_001.fastq.gz > 6305_R1.fastq.gz
cat 6305-No_E_S23_L001_R2_001.fastq.gz 6305-No_E_S23_L002_R2_001.fastq.gz > 6305_R2.fastq.gz

But for some reason, as you can see the script below, nextflow/groovy didn't seem to sort fastq files by name.

#!/bin/bash -ue
cat 6298-No_E_S16_L002_R1_001.fastq.gz 6298-No_E_S16_L001_R1_001.fastq.gz > 6298_R1.fastq.gz
cat 6298-No_E_S16_L001_R2_001.fastq.gz 6298-No_E_S16_L002_R2_001.fastq.gz > 6298_R2.fastq.gz

Could you advise me on how to prevent this in Nextflow?

Nextflow Fastq • 730 views
ADD COMMENT
1
Entering edit mode
6 months ago
jkim ▴ 170

I have put this post on nextflow slack community and got some feedback. It looks like I would have to deal with groovy so I decided to use different stuff. There's some short and concise code. https://github.com/stephenturner/mergelanes#an-easier-way and I updated it a little bit.

#!/usr/bin/bash

if [ $# -ne 3 ]; then
  echo "bash this.sh [delimiter] [input_fq_dir] [merged_fq_dir]"
  exit 1
fi

set -eou pipefail

delimiter=$1
input_fq_dir=$2
merged_fq_dir=$3

mkdir -p $merged_fq_dir

ls $input_fq_dir/*R1* | cut -d $delimiter -f 1 | sort | uniq | sed "s/$input_fq_dir\///" \
    | while read id; do \
        echo $input_fq_dir/$id*R1*.fastq.gz --\> $merged_fq_dir/$id\_R1.fastq.gz;
        cat $input_fq_dir/$id*R1*.fastq.gz > $merged_fq_dir/$id\_R1.fastq.gz;
        echo $input_fq_dir/$id*R2*.fastq.gz --\>  $merged_fq_dir/$id\_R2.fastq.gz;
        cat $input_fq_dir/$id*R2*.fastq.gz > $merged_fq_dir/$id\_R2.fastq.gz;
      done
ADD COMMENT
2
Entering edit mode
6 months ago

no tested, but merging can be parallelized for R1 and R2. Join on libraryId, later

params.rawdata="NO_FILE";

def getLibraryId( prefix ){
  prefix.split("_")[0] 
}

workflow {
    files_channel = Channel.fromFilePairs("${params.rawdata}/*_R{1,2}*.fastq.gz", flat: true, checkExists: true).
        map{prefix, R1, R2 -> tuple(getLibraryId(prefix), R1, R2)}.
        flatMap{libraryId,R1,R2 -> [
            [[libraryId,"R1"],R1],
            [[libraryId,"R2"],R2]
            ] }.
        groupTuple()

    merge_ch = MERGE(files_channel)

    concat_R1_ch = merge_ch.output.filter{K,FASTQ->K[1].equals("R1")}.map{K,FASTQ->[K[0],FASTQ]}
    concat_R2_ch = merge_ch.output.filter{K,FASTQ->K[1].equals("R2")}.map{K,FASTQ->[K[0],FASTQ]}

    pair_ch = concat_R1_ch.join(concat_R2_ch)

    pair_ch.view()
    }
process MERGE {
tag "${key} ${fastqs}"
input:
    tuple val(key),val(fastqs)
output:
    tuple val(key),path("${key[0]}.${key[1]}.fastq.gz"),emit:output
script:
    def ordered =fastqs.sort{A,B->A.name.compareTo(B.name)}.join(" ")
"""
cat ${ordered} > ${key[0]}.${key[1]}.fastq.gz
"""
}
ADD COMMENT
0
Entering edit mode

Thanks for sharing the code! I will be trying that.

ADD REPLY

Login before adding your answer.

Traffic: 1383 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6