Question: How process inputs based on a filename pattern using CWL
gravatar for thomas.e
3.6 years ago by
thomas.e110 wrote:

This must be answered somewhere but I can't find it. How do I process all files matching a pattern, e.g.


I've seen examples where multiple inputs are explicitly listed but where a pattern is specified.

cwl • 1.4k views
ADD COMMENTlink modified 3.4 years ago • written 3.6 years ago by thomas.e110
gravatar for thomas.e
3.4 years ago by
thomas.e110 wrote:

For others looking at this thread, this is how I solved the problem. There may be better ways.

Basically, I ran my single file workflow as a sub-workflow in a step that scatters across the input files. I did not attempt to get CWL to scan directories for input files, so I will build the input file as a seperate step.

So, the input looks like:

  - class: File
    format: "edam:format_1930"
    location: "../data/S1_R1.fastq.gz"
  - class: File
    format: "edam:format_1930"
    location: "../data/S2_R1.fastq.gz"
  - class: File
    format: "edam:format_1930"
    path: "../data/S1_R2.fastq.gz"
  - class: File
    format: "edam:format_1930"
    path: "../data/S2_R2.fastq.gz"

$namespaces: { edam: }
$schemas: [ ]

I then have a scatter workflow:

class: Workflow
cwlVersion: v1.0

- class: InlineJavascriptRequirement
- class: ScatterFeatureRequirement
- class: StepInputExpressionRequirement
- class: SubworkflowFeatureRequirement

  reads1: File[]
  reads2: File[]
 # I found that with toil some constant input files also need to be reproduced here
      type: File
        class: File
        path: /path/to/adapters/TruSeq3-PE.fa
        location: /path/to/adapters/TruSeq3-PE.fa

  [all the workflow outputs here]


    run: single-file-pl.cwl

    scatter: [read1, read2]
    scatterMethod: dotproduct

        source: reads1
        source: reads2
        source: adapters

    out: [all workflow output here]

Hopefully, this helps someone.

ADD COMMENTlink written 3.4 years ago by thomas.e110
gravatar for Michael R. Crusoe
3.6 years ago by
Common Workflow Language project
Michael R. Crusoe1.8k wrote:

Hello thomas.e,

For CWL implementations that consume YAML/JSON input objects you'll need a separate File entry for each file.

Here's an example input object, assuming an input named raw_data and of the type File[] (also known as type: array, items: File )

  - class: File
    path: /path/to/rawdata/000.dat
  - class: File
    path: /path/to/rawdata/001.dat

I've made an issue to add a convenience feature to the reference implementation to make this easier:

ADD COMMENTlink written 3.6 years ago by Michael R. Crusoe1.8k
gravatar for StarvingMarvin
3.6 years ago by
StarvingMarvin20 wrote:

I see two scenarios here:

  1. This is how you would invoke your tool from the shell, and shell would expand glob to the list of files
  2. The tool itself accepts glob as an input

In case of 1. as I mentioned, tool never receives *.dat as an argument. It actually gets resolved glob, a list of file paths. The way to handle this in CWL is with mini workflow: first you have a tool that would receive a list of files as an input and create semantically meaningful outputs, say dat_files, meta_files, other_files. You do that by specifying globs on outputs of that tool. Next you connect dat_files to your tool which than receives all file paths on its command line, precisely as if it got invoked with a glob.

In the other case, the best course of action would be to stage input files to working directory, and pass a glob as a string to the tool.

ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by StarvingMarvin20

Thanks, good suggestions. I'll look at option 2 - when (if) I get sufficiently proficient at CWL that even the smallest things don't take hours :)

ADD REPLYlink written 3.6 years ago by thomas.e110
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2445 users visited in the last hour