Question

OMA in AWS cloud

1

Entering edit mode

7 months ago

Ksel ▴ 10

Dear colleagues,

I am new in cloud environment and I would like to run OMA standalone in AWS environment. Firstly I would like to know if it is possible. Secondly, if it is, what are the parameters to use (AWS configuration specialy for all versus all (more than 100 large genomes, special OMA command...).

I didn't find anything in the web.

Many thanks for your feedback

OMA AWS • 980 views

ADD COMMENT • link updated 6 months ago by Adrian Altenhoff ★ 1.1k • written 7 months ago by Ksel ▴ 10

0

Entering edit mode

if it is possible.

Yes, most things are possible on AWS.

I didn't find anything in the web.

I highly highly doubt that.

ADD REPLY • link 7 months ago by Ram 43k

0

Entering edit mode

See this post.

ADD REPLY • link 6 months ago by stefan ▴ 10

score 2 · Answer 1 · 2023-09-27

Hi,

we didn't try to run OMA standalone on AWS so far, but we would be certainly curious to hear about your feedback. Note that OMA standalone is available as a docker container, so the burden doesn't seem to be too high. However, if you want to run the All-vs-All with several instances in parallel, they must have a shared file system, or at least after the All-vs-all step you need to merge all the files in Cache/AllAll from the individual instances in a common place (there is no communication needed while doing the computation).

Unfortunately I don't know exactly what is possible on AWS and what not. The necessary steps to run it would be to have

your dataset on a storage volume
run oma -c using the docker container from dessimozlab/oma_standalone:latest with the storage mounted on /oma.
run a bunch of oma containers with the -s option, all mounting the storage on /oma. note that you need to specify the total number of processes in advance and add two environment variables to each container, one with NR_PROCESSES=xxx where xxx is the total number of processes you want to run and one with THIS_PROC_NR=yyy where yyy is the number of the process from 1 to xxx.

As said, I don't know how easy it is to specify such things on AWS, but I assume there is a way... ;-)

Best wishes Adrian

score 1 · Answer 2 · 2023-10-26

Hi,

you can use AWS Batch to execute OMA. You can use AWS Batch Array Jobs for parallelization and use Amazon EFS as filesystem. You can also use Amazon FSx for Lustre if you need higher filesystem performance. You can avoid throttling of Docker Hub image pulls by using Amazon Elastic Container Registry (ECR). It is also recommended to build your own up-to-date container image to improve security. The OMA repository contains the required Dockerfile.

I have tested the OMA ToyExample first stage with AWS Batch using below AWS Batch Job Definition. The command in the job definition uses AWS Batch environment variables to set THIS_PROC_NR as recommended by Adrian. You need to set NR_PROCESSES to the AWS Batch Array Size specified at job launch; I used an array size of 6 and set NR_PROCESSES=6 accordingly.

Best regards,

Stefan

  {
  "jobDefinitionName": "oma-test",
  "jobDefinitionArn": "<removed>",
  "revision": 1,
  "status": "ACTIVE",
  "type": "container",
  "parameters": {},
  "retryStrategy": {
    "attempts": 1,
    "evaluateOnExit": []
  },
  "containerProperties": {
    "image": "dessimozlab/oma_standalone:latest",
    "command": [
      "sh",
      "-c",
      "export NR_PROCESSES=6; export THIS_PROC_NR=$((AWS_BATCH_JOB_ARRAY_INDEX+1)); echo Number of processes; echo $NR_PROCESSES; echo This process; echo $THIS_PROC_NR; echo Starting OMA ;oma -s; echo OMA completed; cd /; umount /oma"
    ],
    "volumes": [
      {
        "name": "data-volume",
        "efsVolumeConfiguration": {
          "fileSystemId": "<yourEFSfilesystem>",
          "rootDirectory": "data/dc"
        }
      }
    ],
    "environment": [],
    "mountPoints": [
      {
        "containerPath": "/oma",
        "readOnly": false,
        "sourceVolume": "data-volume"
      }
    ],
    "readonlyRootFilesystem": false,
    "privileged": true,
    "ulimits": [],
    "resourceRequirements": [
      {
        "value": "2",
        "type": "VCPU"
      },
      {
        "value": "2048",
        "type": "MEMORY"
      }
    ],
    "linuxParameters": {
      "devices": [],
      "initProcessEnabled": false,
      "tmpfs": []
    },
    "logConfiguration": {
      "logDriver": "awslogs",
      "options": {},
      "secretOptions": []
    },
    "secrets": []
  },
  "timeout": {
    "attemptDurationSeconds": 300
  },
  "tags": {},
  "platformCapabilities": [
    "EC2"
  ],
  "containerOrchestrationType": "ECS"
}

score 0 · Answer 3 · 2023-10-11

Hi,

I didn't personally run OMA on AWS but I have experience running similar workloads there. AWS offers various services that can accomplish the same goal so it can be confusing to choose from. For this specific workload if this is not a recurring task (you want to do it once or a few times not weekly or offer it as a service to others) I would suggest using EC2 and just treating it as a usual Linux server and running OMA Standalone on it!

Here is what this workflow would look like:

You store your data in S3.
Start a compute-optimized EC2 instance with enough CPU cores for the All-vs-All step.
Upload the All-vs-All result to S3.
Start a memory-optimized EC2 instance with enough RAM for the Orthology Inference step.
Upload the final result to S3. Use S3 Glacier if you don't retrieve the data frequently.

For more data on how to run OMA standalone and how to choose the right amount of CPU and RAM see the official post.
For more data on S3 Glacier see my post.
For data on EC2 instance type and costs see here. You need a C-type instance for the CPU and M type for the RAM.