Nextflow files not referenced correctly when using wildcard in a for loop
2
0
Entering edit mode
9 months ago

Hi, I'm having some problems with my nextflow workflow when I use wildcards (*) to call in files. The files are created fine, (using process augment below) but when it is used by process snarls, it calls them as follows:

CH-A2504_1.aug.gam -> workdir/2c/ce66a6417872a428111b7c2a5995d4/CH-A2504_01.aug.gam
CH-A2504_1.aug.pg ->  workdir/2c/ce66a6417872a428111b7c2a5995d4/CH-A2504_01.aug.pg
... 
... 
CH-A2504_23.aug.gam -> workdir/2c/ce66a6417872a428111b7c2a5995d4/CH-A2504_MT.aug.gam
CH-A2504_23.aug.pg -> workdir/2c/ce66a6417872a428111b7c2a5995d4/CH-A2504_MT.aug.pg

i.e., _1.aug instead of _01.aug, through to _23.aug instead of _MT.aug

Workflow:

include { mapping } from './../modules/mapping'
include { prepGamp } from './../modules/prepGamp'
include { augment } from './../modules/augment'
include { snarls } from './../modules/snarls'
include { pack } from './../modules/pack'
include { callVariants } from './../modules/callVariants'



workflow genomeGraph {
    take: ch_samples
    main:
           ch_mapping = mapping(ch_samples,params.XG,params.GCSA,params.DIST,params.SNARLS,params.PATHS,params.outdir)
           ch_prepGamp = prepGamp(ch_mapping,params.XG)
           ch_augment = augment(ch_prepGamp,params.CHUNK)
           ch_snarls = snarls(ch_augment)
           ch_pack = pack(ch_augment)
           ch_callVariants = callVariants(ch_augment,ch_snarls,ch_pack)
}

.

process augment {

tag { "Augment - ${filename}" }
publishDir "${params.outdir}/${group}/${filename}/Augment", mode: 'copy'
label 'process_vg'

input:
    tuple val(filename), val(group), val(sample), val(outdir), path("${filename}_mapped.sorted.gam")
    val(CHUNK)

output:
    tuple val(filename), val(group), val(sample), val(outdir), path ("${filename}_*.aug.pg"),
    path ("${filename}_*.aug.gam"), emit: ch_augment

shell:
'''
for i in $(seq -w 01 22; echo MT; echo X; echo Y); do
vg augment \
    -pv \
    "!{CHUNK}/SplicedGraph_GRCh37_chunk_${i}.pg" \
    "!{filename}_mapped.sorted.gam" \
    -s \
    -m 2 \
    -q 5 \
    -Q 5 \
    -A "!{filename}_${i}.aug.gam" > "!{filename}_${i}.aug.pg"
done
'''
}

.

process snarls {

tag { "Snarls - ${filename}" }
publishDir "${params.outdir}/${group}/${filename}/Snarls", mode: 'copy'
label 'process_vg'

input:
    tuple val(filename), val(group), val(sample), val(outdir), path ("${filename}_*.aug.pg"), path ("${filename}_*.aug.gam")

output:
    tuple val(filename), val(group), val(sample), val(outdir), path ("${filename}_*.snarls"), emit: ch_snarls

shell:
'''
for i in $(seq -w 01 22; echo X; echo Y; echo MT); do
    echo "Computing Chr ${i} snarls. Please wait ...";
    vg snarls \
       !{filename}_${i}.aug.pg > !{filename}_${i}.snarls
done
'''
}

Any help would be much appreciated, I sure it is something super simple that I'm missing.

wildcard Nextflow • 1.0k views
ADD COMMENT
3
Entering edit mode
9 months ago

don't use this kind of loop in a shell script but tell nextflow about the chromosomes and combine with the other process

    chr_ch=Channel.of("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","X","Y")

     ch_augment = augment(ch_prepGamp.combine(chr_ch),params.CHUNK)
    (...)
    process augment {

    input:
        tuple val(filename),(...),val(contig)
        val(CHUNK)
    (...)

    shell:
    '''
    vg augment \
        -pv \
        "!{CHUNK}/SplicedGraph_GRCh37_chunk_${contig}.pg" \
    (...)
    """
    }
ADD COMMENT
0
Entering edit mode

Thank you very much, that appears to be working. I just had to amend the ${contig} to !{contig} as it is a Nextflow variable, not a bash variable (took a few errors for me to pich that up haha). Thank you again for all your help

ADD REPLY
2
Entering edit mode
9 months ago

Did you check the .command.sh files in the task dirs of your snarls tasks to check if the command being run is what you expect?

The first thing to check here is if your shell script is generating the wrong content or Nextflow. You also have MT, X, and Y, in the first process and X, Y, and MT in the second one. ${i} refers to a Nextflow variable, while you actually want to refer to a shell script variable. You don't want it to be resolved before so what people usually do is \$i with script instead of shell, but it's up to you.

You didn't provide a minimal reproducible example, but I have a working example below. If even after the tips I gave above you don't manage to make it work, try to evolve the example below to what you have and see what's breaking.

process FOO {
  input:
    val x

  output:
    path "*.txt"

  script:
    """
    for i in \$(seq -w 01 22; echo X; echo Y; echo MT); do
      echo \$i > x_${x}_\$i.txt
    done
    """
}

process BAR {
  debug true

  input:
    path text_file

  output:
    stdout

  script:
    """
      echo filenames: $text_file
    """
}

workflow {
  Channel
    .of(1..3)
    | FOO
    | BAR
}
ADD COMMENT
1
Entering edit mode

Thank you very much! I did check the snarls .command.sh and it appears correct. I'll make the suggested changes and see how I go.

ADD REPLY
1
Entering edit mode

Just pay attention that you're using shell and I'm using script. This makes a difference in the way you refer to the shell script and nextflow variables!

ADD REPLY

Login before adding your answer.

Traffic: 1752 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6