Amzon EC2 for bioinformatics, genomics, NGS analysis
4
4
Entering edit mode
8.5 years ago
katie.duryea ▴ 40

Hi,

I am starting a new thread on this topic because I can't find one that is less that a year old. Does any one have advice on using Amazon EC2 for bioinformatic analysis, specifically RNA seq data? It seems there are many instance options these days and I would appreciate some feedback on how to choose among them and if there are any instances specifically tailored for this type of analysis.

Thanks in advance for any input!

Katie

RNA-Seq next-gen • 7.9k views
ADD COMMENT
0
Entering edit mode

You should specify the scale of the data that you wish to solve. What type of data and how much of it?

ADD REPLY
4
Entering edit mode
8.5 years ago
John 13k

Warning, AWS rant approaching: I've been using EC2 for about 6-7 years, and before that I had 4 servers colocated across the UK for some websites that I managed. EC2 provided me with what looked like a really nice set of value-add features, at roughly the same price as my existing hosting costs, which is why i made the switch. For example, installing a new server to cope with Christmas Holiday load was something i'll never forget as being the most stressful time of my early-twenties, because that meant new hardware. Do i lease it or buy it? If I buy it, i have to source compatible parts and make sure its under the wattage requirements. Then i have to physically get it there, get security clearance, wire it up without unplugging anyone else's stuff, configure the network/firewall, updates, etc etc. And then maybe the site gets no traffic on top of what we expected. Or maybe it still all chokes up and falls over. Or maybe that new heatsink wasn't fitted right. Or maybe someone else in the rack adds another server and pulls out my ethernet cable. Managing real hardware was a huge pain bundled with a lot of uncertainty. The most consistent thing about hosting was the price - you paid for the physical space, and the size of your network pipe.

Then EC2 came along and took all those troubles away for me, and instead I just get billed for what I use - which seemed like a fair deal and massively simplified my life. Sure the month-on-month hosting costs are higher, but you don't have to buy hardware and EC2 is so flexible for scaling up or down. Also, I can do all my IP/domain name/load balancing using their GUI, which was and still is so much better than anything built in to a CISCO router or networksolutions.com has to offer. Spreading static content over their CDN is easypeasy. High Availability means customers are never sat staring at an empty page, and I never get calls at 3am. Making secure databases for customer credit card numbers etc is hassle free, and I know if anything goes wrong I can say I did my bit, blame Amazon. They now do SSL certification too, so thats something i'll be moving over soon too.

So what does any of this have to do with Bioinformatics? Well... nothing. And that's my point really. Most of the services AWS provides have no relevance on how bioinformatic data is processed, but those features jack up the prices because people like myself are willing to pay a little extra on the hosting/storage costs for convenience. If you're not using those services, you think you're not paying for them -- but it's factored in to your hosting costs for sure. Yes you can spin up a compute server and "only pay what you use" - but factored into that cost is the convenience of being able to have immediate access in the first place. I say immediate. It's as immediate as 5MB/s to transfer your data to the server from a local source.

But there is another way to have truly immediate access, and no costs for usage or storage, and a simpler interface. Buy a small compute server with 16cpus, 124Gb of RAM, 1Tb of SSD space and 10Tb of HDD space. The cost for all of this, measured in AWS-years, is about 1.5 years. If you plan on doing Bioinformatics for longer than this, you'd be a lot better off in my opinion just buying the hardware and sticking it under your desk.

There are exceptions. If your usage is very 'spikey', like you need 100cores NOW and none tomorrow, then EC2 will give you that. If you need 50cores today and tomorrow, we'll you'll be paying the same amount as the other guy, and not benefiting from any of that convenience.

In my opinion, the next big breakthrough in bioinformatics will have nothing to do with clouds. It's far more likely to be something to do with clusters of small cheap cpus (like the PINE A64), like RAID was for hard drives - or it will be modern languages like Julia or Rust that work with all the low-level vector instructions the CPUs could use if only the software knew how to use them. Honestly, I have no idea why anyone would pay Amazon for a service that you can provide yourself with at home.

ADD COMMENT
0
Entering edit mode

Honestly, I have no idea why anyone would pay Amazon for a service that you can provide yourself with at home.

The problem with this is that an organization should not run anything of value off of someone's home computer.

There are a lot of risks associated with that - it only works short term and when you are the sole person responsible and no one else needed to access that data that from elsewhere (including yourself).

Home internet is also quite unreliable and many orders of magnitude slower than an Amazon instance (I just measured I get 800Mbits/sec in and out of Amazon, see what you get at home) not to mention home internet is throttled and limited in all manner. If you use a lot of data you may just be shut off or your costs may spike a lot. The average internet user uses only 30GB transfer per month. If you start using substantially more your internet hosting company will probably prefer you being gone and replace you with a cheaper user. They may refuse to provide you service at home internet costs.

Now I buy my own hardware but the University pays for the internet, hardware hosting and security costs - those probably add up to be higher than the cost of the hardware.

ADD REPLY
0
Entering edit mode

So what does any of this have to do with Bioinformatics? Well... nothing. And that's my point really. Most of the services AWS provides have no relevance on how bioinformatic data is processed, but those features jack up the prices because people like myself are willing to pay a little extra on the hosting/storage costs for convenience

This is an old thread but I'm just curious about this statement - what kind of services AWS provides that you see it unnecessary for bioinformatics analysis? thanks!

ADD REPLY
0
Entering edit mode

The vast majority of what AWS offers is unrelated to bioinformatic analysis. I give lot of examples in the first two paragraphs of the above, like DNS, SSL, etc. These are features that businesses and SMCs will pay a premium for, because setting that stuff up on your own is a real hassle. I was also trying to explain how "peace of mind" is a service AWS provides that is unnecessary for bioinformatic analysis, but adds significant costs because people in business will pay a lot for it. But in order to see how peace of mind costs money, you have to appreciate that AWS prices it's services at what the market will bear, and not what i think some people assume is as low as humanly possible.

AWS now does grants for publicly funded research (https://aws.amazon.com/grants/), so there is finally the two-tier system for non-profit and for-profit. As a result, almost everything i previously said no longer applies if you get a grant :-)

...i'd still get an under-the-desk compute server in 2017 though, just so you have somewhere to call /home

ADD REPLY
3
Entering edit mode
8.5 years ago

This all depends on your problem I guess. If you're performing RNA Seq analysis, then decide what pipeline you're going to follow and build from there. I'd start with a generic EC2 Ubuntu instance, and install the tools then save that image to use for your analysis.

It depends how generic you want this to be really. For the sake of filling in the gaps, say you have a single RNA seq experiment to carry out, 30Million reads, Paired End, two conditions in triplicate, and the aim is a differential expression analysis. I'd start by creating an instance that has Salmon, Kallisto, STAR, and HTSeq_Count installed. For each sample, I'd align with STAR, Quantify with Salmon and Kallisto, run HTSeq Count post alignment, with the end game being Gene level counts from HTSeq Count, Salmon Counts and Kallisto Counts, per sample. From there I'd work with that data in an R session to test the hypothesis. This is very much a toy example, but my point is that it all depends on how you want to analyse your data.

ADD COMMENT
1
Entering edit mode

Andrew, I'm curious why you recommend running three different quantification methods/tools. Is it to see if the results are consonant, or for some other reason?

ADD REPLY
1
Entering edit mode

Fair point. Kallisto and Salmon + Sleuth are the tools we're testing out for differential transcript expression at the moment, I say both because they do have differences in their methodologies, so getting a consensus is generally reassuring. I'd also use the transcript level counts and feed them into DESeq2, to compare that to the straight and reliable methodology of HTSeq Count gene level counts. It's all personal preference but generally I like to have a cross tool consensus to add to my confidence in what I'm seeing.

ADD REPLY
0
Entering edit mode

Excellent answer. I'd like to add that all your installation should take place in a Amazon EC2 Free Tier instance. You can then create an image and move it on over to your actual work instance. This will save you money when you're just starting out.

ADD REPLY
1
Entering edit mode
8.5 years ago
William ★ 5.3k

You could also provision a bcbio cloud / cluster on top of amazon.

This has the advantage that all necessary (RNA-seq) tools are installed and that you can run common RNA-seq pipelines directly.

http://bcbio-nextgen.readthedocs.org/en/latest/contents/cloud.html

http://bcbio-nextgen.readthedocs.org/en/latest/contents/pipelines.html#rna-seq

I haven't used this but it's what I would consider if I had to run a (RNA) seq analysis that's to large to run on local compute.

ADD COMMENT
1
Entering edit mode
8.5 years ago

I mainly use ec2 for jobs where I don't want to use my local HPC for whatever reason. I setup an ad hoc cluster with StarCluster (http://star.mit.edu/cluster/) and run my jobs either using the pre-install sun grid engine or I use gnu parallel to run across nodes.

I generally use it for:

1) Large eukaryotic genome assembly jobs where distributed memory is needed.

2) Tasks where it might take multiple days, but I want it done in a few hours.

Even using spot-instances, the costs can add up. So buyer beware.

Protip: cc2.8xlarge instances (doesn't exist in newer regions) are usually under-utilized because they are older machines.

ADD COMMENT

Login before adding your answer.

Traffic: 2459 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6