Question

Automating pipeline with Parallel, read files in separate folders

1

Entering edit mode

22 months ago

SaltedPork ▴ 170

I have a pipeline script called pipeline.sh . I usually execute this for a single sample like so:

$ pipeline.sh sample1 sample1.R1.fastq.gz sample1.R2.fastq.gz

Where $1 is sample ID, $2 and $3 are the read files.

I use GNU parallel with a parameters file that specifies the paths to each file.

$ nohup parallel -j 4 -a params.pipeline.txt --colsep '\s+\ ./pipeline.sh

I want to automate my pipeline so that I don't need a parameters file and it looks through the folders for the reads (same file structure as they come out of the sequencer).

I have:

parallel -j 4 ./pipeline.sh {/1.} ::: *.R1.fastq.gz :::+ *.R2.fastqgz

However this assumes the fastqs are in the same folder, how can I change my parallel so that it searches through the folder structure for the files.

fastq automation bash parallel • 1.0k views

ADD COMMENT • link updated 22 months ago by ole.tange ★ 4.4k • written 22 months ago by SaltedPork ▴ 170

0

Entering edit mode

This sounds like something Nextflow would be able to do fairly easily. It's a fairly steep learning curve, but worthwhile. The documentation is also very thorough. I've implemented read mapping, SNP calling, SV calling, and methylation calling pipelines that work with only needing to parameterise input path, run ID, and reference genome path for most and Nextflow handles the rest.

ADD REPLY • link 22 months ago by dthorbur ★ 1.9k

0

Entering edit mode

what's your folder structure?

ADD REPLY • link 22 months ago by cpad0112 21k

2

Entering edit mode

22 months ago

ole.tange ★ 4.4k

Let us assume the files are called:

a/b/c/d/sample1.R1.fastq.gz
a/b/c/d/sample1.R2.fastq.gz
a/b/e/f/sample2.R1.fastq.gz
a/b/e/f/sample2.R2.fastq.gz

Then you may try:

parallel --plus -j 4 ./pipeline.sh {/...} {} {/R1/R2} ::: */*/*/*/*.R1.fastq.gz

ADD COMMENT • link 22 months ago by ole.tange ★ 4.4k

score 4 · Accepted Answer · 2022-06-06

using nextflow (not tested, but it should look like this):

nextflow.enable.dsl = 1
params.directories=""

process scanDirectories {
output:
    path("paths.txt") into paths
script:
"""
find ${params.directories}  -type f -name "*.R1.fq.gz" \
    awk -F '/' '{S=\$NF;gsub("\\.R1\\.fq\\.gz\$","",S);F2=\$0;gsub("\\.R1\\.fq\\.gz\$",".R2.fq.gz",F2);printf("%s,%s,%s\\n",S,\$0,F2);}' > paths.txt

"""
}


paths.splitCsv(header: false,sep:',',strip:true).set{pipe_in}

process runPipeline {
tag "${sample}"
input:
    tuple val(sample),val(R1),val(R2) from pipe_in
output:
    path("result.txt") into result_ch
script:
"""
echo "DO Something ${sample} ${R1} ${R2}" > result.txt
"""
}

and then something like

nextflow run -resume script.nf --directories "/path/to/dir1 /path/to/dir2"