Question

How to implement this two-stage one-to-many workflow using WDL?

0

Entering edit mode

13 months ago

kynnjo ▴ 70

I am having a very difficult time translating my workflows to WDL. The two-stage workflow described below is a case in point.

Suppose that I have two programs, which I'll call FIRST and SECOND.

FIRST generates N files. (The input arguments for FIRST are not important.)

SECOND processes files like those generated by FIRST. (You may assume that SECOND takes, among its arguments, the path to a file like those generated by FIRST.)

I want to implement a workflow where FIRST generates a certain number (N) of files, and, subsequently, N independent runs of SECOND process these N files in parallel.

Can someone show me the WDL to implement a workflow with this general structure?

I should add that, if I were to implement this workflow using, say, a bash script + LSF, I would have the script first run FIRST, putting all the files it generates in one directory D (with nothing else in it), and then I would iterate over the files in this directory D, spawning (via LSF) a parallel run of SECOND for each file encountered.

Unfortunately, as far as I can tell, WDL provides no support for iterating over the contents of a directory. (I find this shocking. I consider iterating over directories as a workhorse operation in bioinformatics.)

wdl • 864 views

ADD COMMENT • link updated 13 months ago by Ruben • 0 • written 13 months ago by kynnjo ▴ 70

2

Entering edit mode

13 months ago

Geraldine ▴ 20

I believe you’re looking for the scatter function. If you set the output of FIRST to be a list of files, you can simply set up a scatter() call to which you pass your FIRST.output_list (or whatever you call the output variable). The SECOND call will live inside the scatter block and will receive the individual files (one per parallel instantiation ).

This is off the top of my head so there might be a subtlety I’m not getting from your post, but I’m pretty confident that reading the description of scatter() will be helpful.

Lmk if you need more detailed guidance.

ADD COMMENT • link 13 months ago by Geraldine ▴ 20

0

Entering edit mode

13 months ago

Ruben • 0

Unfortunately, as far as I can tell, WDL provides no support for iterating over the contents of a directory. (I find this shocking. I consider iterating over directories as a workhorse operation in bioinformatics.)

You can gather the contents of a directory in a list of files using globs.

https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#globs

You can then, as described by the other commenters use scatter to run a set of tasks on each of the files

A simple scatter example is given here: https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#outputs

If you want some real-life examples on WDL I recommend looking at https://github.com/biowdl (disclaimer: I am one of the authors). Particularly https://github.com/biowdl/tasks might save you some work and https://github.com/biowdl/germline-DNA and https://github.com/biowdl/RNA-seq are pipeline we use in production.

ADD COMMENT • link 13 months ago by Ruben • 0

score 2 · Accepted Answer · 2023-03-21

nottested, something like:

version 1.0

workflow BIOSTAR {



    call FIRST {
        }

    scatter (F in FIRST.each_F) {

            call SECOND {
            input:
                f = F
            }
     }

}

task FIRST {


    command <<<

    mkdir -p OUT 

    runyourcommand --output OUT
    find ${PWD}/OUT -type f > chunks.list
    >>>

    output {
         Array[File] each_F = read_lines("chunks.list")
        }

    }


task SECOND {
    input {
          File f
         }
     command <<<
    secondcommand ~{f}
     >>>
}