Question: Parallelize CWL workflow over multiple instances on AWS
1
gravatar for uray10
6 months ago by
uray1010
uray1010 wrote:

Hi everybody,

If I have a workflow which has a step that performs a scatter over n samples, what would be the best way to spawn n EC2 instances so that each instance is responsible for the computation of a single sample?

I realize that for local execution one can specify the --parallel flag to launch n threads and each thread will be responsible for the computation of one sample. However, how do I do this on a per-instance level in AWS?

I was looking into Toil and while I couldn't get much information about their parallelization capabilities, my understanding is that Toil, when used with the autoscaling feature, will distribute the workload of a scatter step over n instances. Is my understanding correct? If not, then what options do I have?

Thanks.

cwl toil • 248 views
ADD COMMENTlink modified 5 months ago by asha.rostamianfar10 • written 6 months ago by uray1010
1
gravatar for asha.rostamianfar
5 months ago by
asha.rostamianfar10 wrote:

One way to do this is to set one of the ResourceRequirements in CWL (e.g. coresMin) to be equal to the instance size.

For instance, you can run your workflow using: toil-cwl-runner ... --nodeTypes m4.xlarge ... and use coresMin: 4 in your CWL definition. You may optionally include --maxNodes to limit the number of instances Toil launches. If your workflow is heterogeneous where you have multiple instance types then limiting based on disk by setting outdirMin: (in MiB) to be slightly less than --nodeStorage (in GiB) [*] is also very robust.

[*] Note that Toil/Mesos use up some disk on the instance, which is why outdirMin should be slightly less than --nodeStorage.

ADD COMMENTlink written 5 months ago by asha.rostamianfar10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2493 users visited in the last hour
_