Using AWS for Bioinformatics
5
0
Entering edit mode
2.5 years ago
dk0319 ▴ 70

Does anyone have experience using Amazons AWS for their bioinformatics work? I have experience using traditional HPC clusters that I can install Anaconda/Mamba on and then conda install any packages I need for my workflows.

The AWS is slightly confusing because while I have installed mamba on my EC2 instance, I am not sure if I will be charged for all the packages I install while not actually running any jobs. For instance would I need to wipe out all my packages after I am finished running my job and then reinstall everything the next time I want to run a job?

Additionally, while I used to use pbs or slurm for submitting my jobs, that does not seem to be an option with the EC2 instances. Is there an alternative to submitting jobs from an EC2 instance so that I can submit a script and then submit a additional jobs without issue?

Any insight or advice would be appreciated

unix Cluster • 4.9k views
ADD COMMENT
2
Entering edit mode
2.5 years ago

if your instance is not hibernated you will be paying for it, regardless of mamba. you will pay for the EBS volume even if you hibernate.

there are frameworks like ParallelCluster which recreate clusters in AWS, but assuming you are on a framework like Nextflow it is usually preferable to use AWS Batch. There is also the Amazon Genomics CLI. They have a lot of information about this https://aws.amazon.com/health/genomics/solutions/. Biostars is more suitable if you have a specific question.

ADD COMMENT
1
Entering edit mode
2.5 years ago

The ENCODE consortium runs workshops that include hands-on sessions to work with the data and pipelines, including AWS accounts as needed. Larger sessions are also under the "ENCODE Jamboree" name, in case that is useful for getting more info from the consortium. That might provide some flavor on the analysis side of things.

Amazon recently held a talk on "Scaling genomics workloads using HPC on AWS". Part of that talk went into how to integrate existing Nextflow and other pipelines to transition from a local HPC to AWS-based cloud work. It might be worth poking around to find this and other talks. There's another talk on "Automating Genomics Workflows on AWS" that might be relevant, for instance.

Other parts of that particular talk on scaling went into more nuts-and-bolts stuff about using spot instances and cold storage to lower overall costs, as AWS is expensive. In the case of spot instances, you bid on a machine instance. If your bid is accepted, you apply a pre-built image to that instance and then get your pipeline running. If someone outbids you, Amazon can "interrupt" your host and you may lose it, so your image and pipeline should be lean, fast, and fault tolerant. If you use a dedicated host, however, then you will pay more, but you won't have to bring it down and reimage on reboot, or worry about it getting interrupted — it stays persistent until you terminate it.

ADD COMMENT
0
Entering edit mode
2.5 years ago

You can use images so that every time, you don't have to set up tools.

ADD COMMENT
0
Entering edit mode
2.5 years ago
cyril-cros ▴ 950

Can you use something like Nextflow executed on AWS / Biocontainers? You get a lot of of bioconda tools you can chain, and that should be portable.

ADD COMMENT
0
Entering edit mode
13 months ago
Simon ▴ 40

In case you weren't aware, bioinformatics platform technology has moved on quite a bit in recent years so I thought I would point out an alternative approach for those in the same situation. If you don't want to spend all the time and resources to set things up yourself in AWS, then second generation bioinformatics platforms like Basepair (and LifeBit I believe) enable you to simply plug in the EC2 and S3 resources in your own AWS account after you have created it. Basically the platform then abstracts away all of the devops that you would normally have to do yourself, yet acts as an easy wrapper around industry standard tools and pipelines for a multitude of NGS data types. You can even deploy and run your own workflows if you'd prefer, and with a per sample usage model that includes 12 months of storage, it's pretty economical if you consider all of the time you save.

ADD COMMENT
0
Entering edit mode

Your posts indicate that you are primarily here to promote your company, which is fine. Rather than posting randomly in threads that are months or years old, you may be better served by creating a new post of the Tool type that highlights services offered by your company. If you throw in a discount of some kind, you may even generate a good number of clicks and inquiries. As it were, there are many ChIP-Seq tutorials on the internet, and posts in these old threads are unlikely to be seen by many people.

ADD REPLY
0
Entering edit mode

Thanks for your guidance Mensur, much appreciated

ADD REPLY
0
Entering edit mode

I don't know the rules here but I would say it's an absolute ethical requirement to disclose if you have any kind of stake in a tool or company you are promoting.

ADD REPLY

Login before adding your answer.

Traffic: 1968 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6