Parallelize CWL workflow over multiple instances on AWS
1
1
Entering edit mode
12 months ago
uray10 ▴ 10

Hi everybody,

If I have a workflow which has a step that performs a scatter over n samples, what would be the best way to spawn n EC2 instances so that each instance is responsible for the computation of a single sample?

I realize that for local execution one can specify the --parallel flag to launch n threads and each thread will be responsible for the computation of one sample. However, how do I do this on a per-instance level in AWS?

I was looking into Toil and while I couldn't get much information about their parallelization capabilities, my understanding is that Toil, when used with the autoscaling feature, will distribute the workload of a scatter step over n instances. Is my understanding correct? If not, then what options do I have?

Thanks.

cwl Toil • 476 views
0
Entering edit mode

Hello uray10,

As a reminder, we moved CWL community support from this forum to https://cwl.discourse.group/ which is why I didn't see this sooner.

2
Entering edit mode
11 months ago

One way to do this is to set one of the ResourceRequirements in CWL (e.g. coresMin) to be equal to the instance size.

For instance, you can run your workflow using: toil-cwl-runner ... --nodeTypes m4.xlarge ... and use coresMin: 4 in your CWL definition. You may optionally include --maxNodes to limit the number of instances Toil launches. If your workflow is heterogeneous where you have multiple instance types then limiting based on disk by setting outdirMin: (in MiB) to be slightly less than --nodeStorage (in GiB) [*] is also very robust.

[*] Note that Toil/Mesos use up some disk on the instance, which is why outdirMin should be slightly less than --nodeStorage.

0
Entering edit mode

This is a pretty good hack. Be sure to tell the underlying tool that it has access to all the cores via arguments: [ --threads=$(runtime.cores) ] or similar based on the command line options that tool has. ADD REPLY 0 Entering edit mode This is a pretty good hack. Be sure to tell the underlying tool that it has access to all the cores via arguments: [ --threads=$(runtime.cores) ] or similar based on the command line options that tool has.