Question

Forum:What are the pain points of using Genomics on the cloud today?

1

Entering edit mode

2.2 years ago

datanerd ▴ 520

HI guys, I have heard many reasons that researchers do not want to move to cloud for their analysis. I definitely see cost being one of the challenges - which is being currently worked out. With pipelines/workspaces and data available on the cloud- what could be improved for better use of cloud between labs/institutes for genomic analysis.

Mamta

datasharing pipelines workspaces cloud genomics • 1.5k views

ADD COMMENT • link updated 2.2 years ago by Jeremy Leipzig 22k • written 2.2 years ago by datanerd ▴ 520

1

Entering edit mode

Cost is the least of the challenges IMO. Data privacy and data transfer times, as well as the level of control on how the data is processed are bigger challenges. Personally, I think another pain point is translating a simple step (say, a bcftools norm run) to a step on Seven Bridges or a different platform, where I'm required to understand some sort of workflow language to edit existing workflows. Plus, there are very few platforms that allow as much customizability as SB. Some platforms are black boxes created for biologists, so bioinformaticians are stuck pre- and post- processing datasets instead of having a one-stop solution customized to the group's requirements.

ADD REPLY • link 2.2 years ago by Ram 43k

1

Entering edit mode

And then there is the problem of cloud interoperability. Though NIH is working on finding solutions. Problem is with these things there is not one solution that fits all nor can you please everyone.

ADD REPLY • link 2.2 years ago by GenoMax 141k

0

Entering edit mode

agree- I believe they are working to solve this issue.

ADD REPLY • link 2.2 years ago by datanerd ▴ 520

0

Entering edit mode

great points you brought up.

Data privacy/control and transfer - I agree 100%
Data processing- with WDLs (simple readable language) are quite customizable I would think? May be there should be more interoperability between the different workflow language. Is this what you are referring to or something else?

When you say one-stop solution, you mean workflow that could start from absolute raw files, or automate the formatting/processing to some interpretable results? However, aren't the tertiary analysis quiet variable between researchers and more interactive? Do you think this is something that could be solved with notebooks that just picks up from where the workflow left off and can be more modifiable?

ADD REPLY • link 2.2 years ago by datanerd ▴ 520

0

Entering edit mode

Data processing- with WDLs (simple readable language) are quite customizable I would think?

Yes, in theory. However, it is one more thing to learn for us, which is quite annoying when I've spent hours/days/weeks learning how to use the tool and all its idiosyncrasies.

you mean workflow that could start from absolute raw files, or automate the formatting/processing to some interpretable results

Yes, FASTQ -> TSV, ideally. You raise a valid point when you say "aren't the tertiary analysis quiet [sic] variable between researchers" - I'd like to be able to adapt pieces of others' analyses that I find insightful and customize my pipeline, which is easy to do on a local machine/cluster but not so easy on the cloud - I don't blame the providers - they walk a really thin line between providing SaaS and IaaS, and the more advanced we get, the closer to IaaS we get. At that point, you'll need control on the AWS instances that Seven Bridges (for example) runs on, which SB can't give you because why would they?

ADD REPLY • link 2.2 years ago by Ram 43k

0

Entering edit mode

The Biostar Handbook I have a chapter that I call "How not to waste your time" where I talk about cloud computing in general, CWL in particular, and how these solutions are little more than an illusion of getting ahead.

https://www.biostarhandbook.com/how-not-to-waste-your-time.html

I will cite a short subsection where I mention what I call the "False Peaks of Biology"

The False Peaks of Biology

In mountaineering, a false peak or false summit is a peak that appears to be the pinnacle of the mountain, but upon reaching, it turns out the summit is higher. False peaks can have significant effects on climbers’ psychological states by inducing feelings of dashed hopes or even failure.

There is a simple explanation of why the techniques and methods listed on this page even exist. Some are even supported by continually renewed, multi-million dollar governmental research grants.

You see, the vast majority of life scientists mistakenly believe that the primary limitation to understanding biology is their inability to run some existing software.

All these scientists believe that if they could get over this first hurdle, the software mountain that they see right in front of them, everything would start making sense and that they would be well on their way to making scientific discoveries.

enter image description here

ADD REPLY • link 2.2 years ago by Istvan Albert 100k

0

Entering edit mode

CWL actually has great ideas about tool descriptor metadata, and since those are essentially write-once/use-many, it is worth describing a tool's behavior and usage in detail. It would be great if other frameworks accepted CWL tool descriptors. I am not as keen on the workflow parts of CWL.

ADD REPLY • link 2.2 years ago by Jeremy Leipzig 22k

0

Entering edit mode

I think using the "cloud" needs to be defined better, that way the pitfalls become clearer.

Does the "cloud" mean that we interact with it via a browser? Then, more often than not the browser is the limitation.

ADD REPLY • link 2.2 years ago by Istvan Albert 100k