Question

CWL: Setting output directories for steps in a workflow

0

Entering edit mode

8.4 years ago

Peter Humburg ▴ 50

I'm trying to set up a workflow with CWL and am struggling to figure out how to place the output files generated by the different steps in the workflow into their own directories. As it stands, all output files are put in the same directory, creating a lot of clutter. I would very much prefer to have different sub-directories for each step.

Right now I have something like this:

working directory
 -- fastq
     -- read_1.fq
     -- read_2.fq
     ...
 -- output   # <-- all output files are dumped here

What I want is something like this:

working directory
 -- fastq
     -- sample1_read_1.fq
     -- sample1_read_2.fq
     -- sample2_read_1.fq
     -- sample2_read_2.fq
     ...
 -- trimmed
    -- sample1_read_1.trimmed.fq
    -- sample1_read_2.trimmed.fq
    -- sample2_read_1.trimmed.fq
    -- sample2_read_2.trimmed.fq
     ...
 -- bam
    -- sample1.bam
    -- sample2.bam
    ...
 -- qc
    -- QC reports
 ...

I can set up the individual command line tools to write their output to a directory but the only way I have found to make that directory show up in the output, as indicated above, is to designate the whole directory as the output of that tool. While that would produce the directory structure, I want it introduces two problems. Firstly, it makes it harder to access individual files that are required for a subsequent step in the workflow. An obvious example is keeping files for the first and second read separated properly. Secondly, I'm using toil to run this workflow, and that doesn't support directories as inputs (strictly speaking this isn't a CWL issue of course), complicating things further.

This seems like something that should be easy. Am I missing something obvious here? Any advice on how to do this would be much appreciated.

cwl • 5.7k views

ADD COMMENT • link updated 8.3 years ago by peter.amstutz ▴ 300 • written 8.4 years ago by Peter Humburg ▴ 50

0

Entering edit mode

Are you wanting to return the output of each step as a workflow output because 1) Each intermediate result is a useful and important output on its own or 2) For troubleshooting/debugging purposes

If the reason is 2) then that is a bit out of scope for the CWL language itself -- all the needed information is available to the platform executing your CWL descriptions and they are well placed to provide you options to preserve intermediate outputs and present them to you in a pleasing and useful way. In this case I would invite you to talk with the Toil team about the availability of such a feature. The reference implementation of CWL has a --leave-outputs option to "Leave output files in intermediate output directories" but that produces rather ugly paths at the moment.

If the reason is 1) then I direct you to the answer below.

Cheers,

ADD REPLY • link 8.3 years ago by Michael R. Crusoe ★ 1.9k

0

Entering edit mode

Good point about debugging. That is certainly part of the reason, and I agree that extracting the intermediate files from the temporary directories is perfectly fine for that. There is the other aspect as well, but between your and Peter's answer that has been pretty well covered.

Just to illuminate the use case a bit I'll add that with an example like the one above, I'd usually want to keep the QC report and the BAM files. In addition, there is typically output for either gene expression estimates, variant calls or similar, that also need to be kept.

ADD REPLY • link 8.3 years ago by Peter Humburg ▴ 50

score 2 · Answer 1 · 2017-03-10

2

Entering edit mode

8.3 years ago

peter.amstutz ▴ 300

cwltool has an option --leave-outputs: "Leave output files in intermediate output directories."

This may not be quite what you want, though.

You can use an ExpressionTool to take a group of files and produce a Directory object. This looks like:

cwlVersion: v1.0
class: ExpressionTool
requirements:
  InlineJavascriptRequirement
inputs:
  file1: File
  file2: File
outputs:
  out: Directory
expression: |
  ${
    return {"out": {
      "class": "Directory", 
      "basename": "my_directory_name",
      "listing": [inputs.file1, inputs.file2]
    } };
  }

Then you can return the Directory object in the final output instead of the individual files.

(longer term, it might be good to introduce a new CWL feature for grouping outputs)

ADD COMMENT • link 8.3 years ago by peter.amstutz ▴ 300

1

Entering edit mode

Thanks for your answer. The use of ExpressionTool looks like the best, or at least most flexible, way to go about this with the currently available facilities. A feature to simplify the grouping of outputs would, of course, be great. In the meantime, it might be useful to mention ExpressionTool in the User Guide. I simply didn't realise it existed, although I now realise it is in the Workflow Spec.

ADD REPLY • link 8.3 years ago by Peter Humburg ▴ 50

0

Entering edit mode

I have had the same question regarding CWL, and this is a helpful solution.

I would be curious if there could eventually be a way to support referencing the files in the Directory class with a simple notation such as my_directory.fastq_1.

There are several stages in our workflow that require referencing files from such directory outputs, as well as adding new ones to it, and it is cumbersome to use valueFrom and ExpressionTools at each intermediate step to reference and combine output and input files. I believe this feature would be very valuable.

ADD REPLY • link 7.1 years ago by ionox0 ▴ 390

score 1 · Answer 2 · 2017-03-10

1

Entering edit mode

8.3 years ago

Michael R. Crusoe ★ 1.9k

Hello Peter,

Thank you for your thoughtful question.

A quick response: a CWL tool can have both a "all in one" Directory output in addition to specific file outputs, the overlap is fine.

ADD COMMENT • link 8.3 years ago by Michael R. Crusoe ★ 1.9k

1

Entering edit mode

Thanks for your answer. I have to admit that it never occurred to me to simply specify an additional directory output. That certainly gets the job done for the case I asked about (per tool output directories). And Peter's answer below covers the more general case.

ADD REPLY • link 8.3 years ago by Peter Humburg ▴ 50