Question

Forum:Pain points using commercial clouds

6

Entering edit mode

22 months ago

Sasha ▴ 850

Folks that run on Google Cloud or AWS - what are your biggest pain points?

I understand that the following are difficult things to understand:

The types of pipelines that you've run on a sample
What modifications were performed on a sample after you got it back from the sequencer
Getting a wholistic picture of your pipelines and samples as an experiment unit
Why a pipeline failed

Additionally - do you run your jobs on AWS/Google batch or on VMs?

Let me know in the DMs/Twitter/LinkedIn or in the comments! Looking for some data points for a future product.

gpt tinybio • 3.0k views

ADD COMMENT • link updated 22 months ago by vincenthus ▴ 70 • written 22 months ago by Sasha ▴ 850

2

Entering edit mode

Next time better to include social links if you want people to contact you :)

For me the most difficult thing is environment setup and running into bugs of not well maintained tools.

ADD REPLY • link 22 months ago by vincenthus ▴ 70

0

Entering edit mode

I thought it would we good if Sasha could also interact directly with this thread, now that the discussion has started.

ADD REPLY • link 22 months ago by Michael 56k

score 4 · Answer 1 · 2023-09-12

4

Entering edit mode

22 months ago

Michael 56k

I believe the biggest show-stopper using commercial cloud systems, at least for academics is how to pay for it. Often times there is no funding allocation for external cloud computation and storage and not many funding agencies are requiring that researchers explicitly apply for compute and storage quota funding beforehand. For some bioinformatics tasks, we need small or medium-sized HPC instances (128GB-1TB RAM, 20-80 vCPUS). How would we pay for this?

Instead, we are provided with local servers or medium to large HPC infrastructures, such as NREC (NREC may indeed offload some computation to AWS at times) in Norway or LUMI for Europe. These are essentially either free or very low cost for academics in comparison to commercial providers.

This is of course different for businesses who sometimes also can access these infrastructures but at a totally different rate.

In conclusion: there is a large portion of academics in the bioinformatics field for whom the use of AWS and other commercial providers for research is totally unattractive.

ADD COMMENT • link 22 months ago by Michael 56k

2

Entering edit mode

This used to be my painpoint in academia, too. I'd add that even if you have the funds from a grant to run things on AWS, chances are you won't have that funding in 2 or 3 years. Who will then pay for the data on S3?

ADD REPLY • link 22 months ago by Philipp Bayer 8.8k

0

Entering edit mode

But isn't S3 is a very small cost though compared to compute? I mean I could imagine you have very large datasets of-course. I am coming from the software engineering field so correct me if I am wrong.

it is 0.02$ per GB per month or $20 per TB -- but I believe if you gzip it, it can be much cheaper ($2 per TB?).

Then renting a decent EC2 can cost $2 per hour for 40 CPUs and 160 GB.

ADD REPLY • link 22 months ago by vincenthus ▴ 70

1

Entering edit mode

Besides how to pay for, it is the lack of transparency in pricing. It's very hard to figure out how much running something is going to cost, without even trying to compare between providers. You need to be quite familiar with the provider's way of doing things to get the most of it so you become specialized and locked-in. As mentioned many academic groups in Europe have access to mostly free compute infrastructure either at a national/regional infrastructure or at their institution or through collaborators. I think this is different in the US where I get the impression that academic compute has essentially been delegated to the private sector.

ADD REPLY • link 22 months ago by Jean-Karim Heriche 27k

score 2 · Answer 2 · 2023-09-13

The reproducibility nightmare

Regarding your bullet points, these are all valid points, but they apply to all computing environments. Even a cloud computer is still just a virtual computer that presents itself to you as a simple Bash shell in 99% of cases. Even with the best intentions to implement a standardized workflow, people will likely start experimenting using trial and error on the way, using different tools, reference genomes, and ad hoc format conversion with the awk command they found on Biostars. QC will be unsatisfying for some samples, so they are removed, possibly running everything from scratch. Possibly the result of some tool was not consistent, so we replaced it with another one or upgraded the version in conda. Except of course that one single tool that was not available in Conda so we installed it from source. But later on, we also installed the tool with conda (because it turned out it was available anyway) without specifying the version.

Now we have the output of both versions lying around, but which was which, and is the version of the tool that is installed now identical to what we used earlier?

Of course, we do not call our commands specifying the full path to the executable, because normally we will assume we have only a single program of that name in our PATH ever. So it turns out we are not exactly sure which program and version we used for this analysis.

Hopefully, the tool has written its provenance information into the log or the output file...

Now, the only tool Bash (your "work-flow" interface) provides to alleviate this is .bash_history but it is going to be a mess. Also, we have a habit of running everything on wildcards (do_stuff.sh *.sam) but you had to delete some of the files in the meantime due to space problems. So in the end you have no clue what you effectively ran your stuff on either.

There are also these nasty tools that allow you or even require you to specify vital parameters interactively. You desperately wanted to avoid this but fatigue set in and you didn't feel like sorting out how to specify these minor parameters on the command line, if it was possible even. Of course, there is no record of your interactions in the bash_history, and therefore you are now completely clueless about which parameters you ran your analysis with.

Finally, you notice that as a result of or despite your efforts, you do not know which program or what version of it you ran on what data using a set of parameters you can only vaguely remember. So you wish you had written the whole analysis in snake-make from the get-go, but there was so much experimentation and tuning involved (it is scientific data after all).

So, if you want to solve something with AI here, you could implement a tool that "understands" what you are doing, optimally also understands the tools and their output, hooks into bash and turns your history into a correct snake-make workflow. Also, when you want to do funny things like installing "the latest" version of a tool without specifying a version it will warn you with a pop-up window to prevent such stupidity.

Remember me if you get rich and famous with the idea.