Question

What is the general structure of a WDL that implements "embarrassing parallelization"?

2

Entering edit mode

13 months ago

kynnjo ▴ 70

(Experienced WDL programmers may find the question below too simple to be real. If so, let me stress that I am just now getting started with WDLs, and that I really don't know how one would tackle problems like the one described below using WDLs.)

Just as an example, suppose I have a Python script /path/to/foo.py that takes, say, one command-line argument, and, upon successful completion, writes out a file to disk(1).

I want to invoke this script N times. Since I have at my disposal M cores (where N ≫ M > 1), I would like to run approximately N/M of these invocations on each one of these cores.

This general problem structure is what I am here calling "embarrassingly parallel".

If I were to accomplish this task using, say, an LSF cluster, I would create M one-time shell scripts, each containing the command lines, one after the other, for approximately N/M of the desired invocations of /path/to/foo.py, each with its corresponding argument, and would then submit these M one-time scripts as separate (and hopefully concurrently-running) LSF jobs. Each of these M jobs would then run its share of N/M invocations of /path/to/foo.py, sequentially, though, ideally, the M jobs would be running in parallel with each other.

My question is: how does one implement the same workflow using a WDL (to be fed to Cromwell) instead of an LSF cluster?

More specifically, what is the general structure of a WDL that implements such embarrassing parallelization?

(1) In case it matters, we may assume that the name of this file depends on the argument, so that runs of the script with different arguments will result in differently-named files.

wdl cromwell • 1.2k views

ADD COMMENT • link updated 13 months ago by Matthias Zepper 4.5k • written 13 months ago by kynnjo ▴ 70

score 1 · Answer 1 · 2023-03-04

1

Entering edit mode

13 months ago

Jeremy Leipzig 22k

WDL is the language that is supported by a few runners (Cromwell, miniWDL among them). In a normal cluster setup they just submit all the parallel jobs into a queue (that's what HPC cluster schedulers like LSF are for), and then it's up to the scheduler to run those in parallel on one or more nodes.

The important thing is to inform the scheduler the number of cores that process requires (if more than one) and the memory requirements. And that's all done in the task description.

task jes_task {
  command {
    echo "Hello JES!"
  }
  runtime {
    docker: "ubuntu:latest"
    memory: "4G"
    cpu: "3"
  }
}
workflow jes_workflow {
  call jes_task
}

https://cromwell.readthedocs.io/en/stable/tutorials/HPCIntro/

If you are asking how Cromwell knows which tasks can be submitted for parallel execution, rather than serially? Well that can be inferred from the "directed acyclic graph" or "dependency graph" that it renders from the workflow you describe in WDL.

ADD COMMENT • link 13 months ago by Jeremy Leipzig 22k

0

Entering edit mode

Thank you! What I'm not yet clear on is the following: if I need to run, say, ten billion small, independent tasks, that can, in principle, all be run in parallel, will Cromwell run ten billion independent jobs (somehow) and process each task in its own job, or will it split the ten billion tasks into, say, 100 subsets, and process these subsets in 100 parallel jobs?

What I'm trying to get at here is that there is overhead in spinning up a core to carry out some work, and therefore, it would be wasteful to do so for one short(1) task. It would instead, make more sense to use such a core to process a whole batch (say 100 million) of those small tasks serially. I wonder if engines like Cromwell already implement such batching strategies on their own, so that one needs only to tell them what needs to be done at the most granular level.

(1) In fact, the overhead of spinning a core up and down is the yardstick I have in mind when I say that a task is "short."

ADD REPLY • link 13 months ago by kynnjo ▴ 70

0

Entering edit mode

I have no knowledge about WDL and it's executors, but 10 billion tasks to me sounds like a problem that you indeed should implement differently in several ways:

Suppose I have a Python script /path/to/foo.py that takes, say, one command-line argument, and, upon successful completion, writes out a file to disk

It will be next to impossible to write 10 billion files to disk: ext4 and NTFS have a limit of 2^32 - 1 files (4,294,967,295) and Fat32 just 2^16 - 1 (65,535). You could use object store, but better write the records into one file (Apache Parquet?) or database.

For processing this many records separately, I fear no workflow executor is suited - neither WDL nor Nextflow or Snakemake. They are meant to parallelize dozens to hundreds of CPU-heavy tasks e.g. on a per-sample basis, but not billions of single records like individual reads. For this, you should either use columnar data frame tools like Pandas or Polars or a stream-processing tools like Apache Flink.

ADD REPLY • link 13 months ago by Matthias Zepper 4.5k

0

Entering edit mode

Actually, in reply to my own comment, I now realize that the question it poses is silly, and not only because, as Matthias Zepper pointed out, the numbers I used by way of example are a bit too ridiculous, but also because it is obvious that Cromwell has no way of knowing that the tasks I'm describing as "small" are indeed small, even if one can make this term reasonably objective (to human readers) by invoking the overhead of spinning up and down a node as yardstick. In other words, it is simply silly to expect that Cromwell could determine a minimum batch size M that it would be worth spinning up and down a core for. Therefore, there is no reasonable way for Cromwell to automatically implement the batching strategy I described in my original post.

ADD REPLY • link 13 months ago by kynnjo ▴ 70