Forum:Pain points using commercial clouds
2
6
Entering edit mode
9 months ago
Sasha ▴ 850

Folks that run on Google Cloud or AWS - what are your biggest pain points?

I understand that the following are difficult things to understand:

  • The types of pipelines that you've run on a sample
  • What modifications were performed on a sample after you got it back from the sequencer
  • Getting a wholistic picture of your pipelines and samples as an experiment unit
  • Why a pipeline failed

Additionally - do you run your jobs on AWS/Google batch or on VMs?

Let me know in the DMs/Twitter/LinkedIn or in the comments! Looking for some data points for a future product.

gpt tinybio • 1.6k views
ADD COMMENT
2
Entering edit mode

Next time better to include social links if you want people to contact you :)

For me the most difficult thing is environment setup and running into bugs of not well maintained tools.

ADD REPLY
0
Entering edit mode

I thought it would we good if Sasha could also interact directly with this thread, now that the discussion has started.

ADD REPLY
4
Entering edit mode
9 months ago
Michael 54k

I believe the biggest show-stopper using commercial cloud systems, at least for academics is how to pay for it. Often times there is no funding allocation for external cloud computation and storage and not many funding agencies are requiring that researchers explicitly apply for compute and storage quota funding beforehand. For some bioinformatics tasks, we need small or medium-sized HPC instances (128GB-1TB RAM, 20-80 vCPUS). How would we pay for this?

Instead, we are provided with local servers or medium to large HPC infrastructures, such as NREC (NREC may indeed offload some computation to AWS at times) in Norway or LUMI for Europe. These are essentially either free or very low cost for academics in comparison to commercial providers.

This is of course different for businesses who sometimes also can access these infrastructures but at a totally different rate.

In conclusion: there is a large portion of academics in the bioinformatics field for whom the use of AWS and other commercial providers for research is totally unattractive.

ADD COMMENT
2
Entering edit mode

This used to be my painpoint in academia, too. I'd add that even if you have the funds from a grant to run things on AWS, chances are you won't have that funding in 2 or 3 years. Who will then pay for the data on S3?

ADD REPLY
0
Entering edit mode

But isn't S3 is a very small cost though compared to compute? I mean I could imagine you have very large datasets of-course. I am coming from the software engineering field so correct me if I am wrong.

it is 0.02$ per GB per month or $20 per TB -- but I believe if you gzip it, it can be much cheaper ($2 per TB?).

Then renting a decent EC2 can cost $2 per hour for 40 CPUs and 160 GB.

ADD REPLY
1
Entering edit mode

Besides how to pay for, it is the lack of transparency in pricing. It's very hard to figure out how much running something is going to cost, without even trying to compare between providers. You need to be quite familiar with the provider's way of doing things to get the most of it so you become specialized and locked-in. As mentioned many academic groups in Europe have access to mostly free compute infrastructure either at a national/regional infrastructure or at their institution or through collaborators. I think this is different in the US where I get the impression that academic compute has essentially been delegated to the private sector.

ADD REPLY
2
Entering edit mode
9 months ago
Michael 54k

The reproducibility nightmare

Regarding your bullet points, these are all valid points, but they apply to all computing environments. Even a cloud computer is still just a virtual computer that presents itself to you as a simple Bash shell in 99% of cases. Even with the best intentions to implement a standardized workflow, people will likely start experimenting using trial and error on the way, using different tools, reference genomes, and ad hoc format conversion with the awk command they found on Biostars. QC will be unsatisfying for some samples, so they are removed, possibly running everything from scratch. Possibly the result of some tool was not consistent, so we replaced it with another one or upgraded the version in conda. Except of course that one single tool that was not available in Conda so we installed it from source. But later on, we also installed the tool with conda (because it turned out it was available anyway) without specifying the version.

Now we have the output of both versions lying around, but which was which, and is the version of the tool that is installed now identical to what we used earlier?

Of course, we do not call our commands specifying the full path to the executable, because normally we will assume we have only a single program of that name in our PATH ever. So it turns out we are not exactly sure which program and version we used for this analysis.

Hopefully, the tool has written its provenance information into the log or the output file...

Now, the only tool Bash (your "work-flow" interface) provides to alleviate this is .bash_history but it is going to be a mess. Also, we have a habit of running everything on wildcards (do_stuff.sh *.sam) but you had to delete some of the files in the meantime due to space problems. So in the end you have no clue what you effectively ran your stuff on either.

There are also these nasty tools that allow you or even require you to specify vital parameters interactively. You desperately wanted to avoid this but fatigue set in and you didn't feel like sorting out how to specify these minor parameters on the command line, if it was possible even. Of course, there is no record of your interactions in the bash_history, and therefore you are now completely clueless about which parameters you ran your analysis with.

Finally, you notice that as a result of or despite your efforts, you do not know which program or what version of it you ran on what data using a set of parameters you can only vaguely remember. So you wish you had written the whole analysis in snake-make from the get-go, but there was so much experimentation and tuning involved (it is scientific data after all).

So, if you want to solve something with AI here, you could implement a tool that "understands" what you are doing, optimally also understands the tools and their output, hooks into bash and turns your history into a correct snake-make workflow. Also, when you want to do funny things like installing "the latest" version of a tool without specifying a version it will warn you with a pop-up window to prevent such stupidity.

Remember me if you get rich and famous with the idea.

ADD COMMENT
0
Entering edit mode

Here is what ChatGPT responded after giving it my above text:

Your description of the reproducibility nightmare in computing is all too familiar to anyone who has worked extensively in data analysis and scientific research. The issues you've highlighted, such as the lack of a standardized workflow, ad hoc tool usage, unclear provenance, and the challenges of tracking parameters, are indeed pervasive in computational research.

The idea of using AI to help address these challenges is intriguing. Here are some potential ways AI could be applied to improve reproducibility in computational research:

  1. Provenance Tracking: AI could assist in automatically tracking the provenance of data and analysis steps. It could parse log files, record tool versions, input data, and parameters used, and create a structured record of the analysis process.

  2. Workflow Automation: AI-driven systems could suggest or even automatically generate workflow scripts (e.g., in Snakemake or other workflow management systems) based on the commands executed by users. It could infer dependencies between steps and ensure that they are executed in a reproducible manner.

  3. Parameter Logging: AI could log interactive parameter selections made during tool usage and integrate this information into the workflow record. This would help researchers keep track of specific parameter choices.

  4. Environment Management: AI could monitor and manage the software environment, ensuring that tools are installed from controlled sources (like Conda) and tracking versions rigorously.

  5. Interactive Warnings: As you mentioned, AI could provide warnings or recommendations when users attempt actions that might compromise reproducibility, such as installing the latest version of a tool without specifying a version.

  6. Wildcards and File Management: AI could assist in managing files, especially when wildcards are used, by keeping track of what files were available at the time of analysis and warning users about changes.

  7. Natural Language Understanding: Advanced AI systems could interpret natural language queries from users about their past analyses and provide meaningful responses, making it easier to retrieve information about past experiments.

  8. Reproducibility Reports: AI could generate detailed reports summarizing an analysis, including the environment, data, tools, and parameters used, making it easier for others to replicate the work.

Implementing such a tool would require a deep understanding of the intricacies of computational research, including the diverse set of tools and practices involved. It would also need to be highly customizable to accommodate the unique needs of different research domains.

While creating such a tool is a complex endeavor, it's an exciting idea that could significantly improve the reproducibility of scientific research. If someone were to successfully develop and commercialize such a solution, it could have a profound impact on the field of computational research. Your idea highlights the potential for AI to address critical challenges in scientific computing, and it's certainly worth exploring further if you have the expertise and resources to pursue it.

ADD REPLY
1
Entering edit mode

Points 1-4, 6, 8 are in principle already available with workflow management systems so the advantage of AI would be in analyzing your workflow to provide warnings/recommendations (point 5) and the ability to query logs in natural language (point 7). What about having AI translate a workflow expressed in natural language into an exectuable one with all the bells and whistles listed above?

ADD REPLY
1
Entering edit mode

Also, Sasha's company has a product that makes small example workflows for basic tasks. This post was maybe directed to that. My proposal, however, is about generating the WF directly from your Bash interactions. Whenever I attempt to make everything in WF-language from the beginning, I drop out because of testing multiple tools and the steep learning curve of managing parameterization and WF-language simultaneously.

For example, for our last paper, I sincerely wished to re-implement and reproduce a previously published analysis workflow up to identical software versions in NextFlow. Once we were almost through, we showed that this workflow was inappropriate as it produced biased results. So I had to start out from scratch, this time just wrapping all the command calls in shell scripts.

ADD REPLY

Login before adding your answer.

Traffic: 1409 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6