Entering edit mode
16 months ago
datanerd ▴ 520
HI guys, I have heard many reasons that researchers do not want to move to cloud for their analysis. I definitely see cost being one of the challenges - which is being currently worked out. With pipelines/workspaces and data available on the cloud- what could be improved for better use of cloud between labs/institutes for genomic analysis.
Cost is the least of the challenges IMO. Data privacy and data transfer times, as well as the level of control on how the data is processed are bigger challenges. Personally, I think another pain point is translating a simple step (say, a
bcftools normrun) to a step on Seven Bridges or a different platform, where I'm required to understand some sort of workflow language to edit existing workflows. Plus, there are very few platforms that allow as much customizability as SB. Some platforms are black boxes created for biologists, so bioinformaticians are stuck pre- and post- processing datasets instead of having a one-stop solution customized to the group's requirements.
And then there is the problem of cloud interoperability. Though NIH is working on finding solutions. Problem is with these things there is not one solution that fits all nor can you please everyone.
agree- I believe they are working to solve this issue.
great points you brought up.
When you say one-stop solution, you mean workflow that could start from absolute raw files, or automate the formatting/processing to some interpretable results? However, aren't the tertiary analysis quiet variable between researchers and more interactive? Do you think this is something that could be solved with notebooks that just picks up from where the workflow left off and can be more modifiable?
Yes, in theory. However, it is one more thing to learn for us, which is quite annoying when I've spent hours/days/weeks learning how to use the tool and all its idiosyncrasies.
Yes, FASTQ -> TSV, ideally. You raise a valid point when you say "aren't the tertiary analysis quiet [sic] variable between researchers" - I'd like to be able to adapt pieces of others' analyses that I find insightful and customize my pipeline, which is easy to do on a local machine/cluster but not so easy on the cloud - I don't blame the providers - they walk a really thin line between providing SaaS and IaaS, and the more advanced we get, the closer to IaaS we get. At that point, you'll need control on the AWS instances that Seven Bridges (for example) runs on, which SB can't give you because why would they?
The Biostar Handbook I have a chapter that I call "How not to waste your time" where I talk about cloud computing in general, CWL in particular, and how these solutions are little more than an illusion of getting ahead.
I will cite a short subsection where I mention what I call the "False Peaks of Biology"
CWL actually has great ideas about tool descriptor metadata, and since those are essentially write-once/use-many, it is worth describing a tool's behavior and usage in detail. It would be great if other frameworks accepted CWL tool descriptors. I am not as keen on the workflow parts of CWL.
I think using the "cloud" needs to be defined better, that way the pitfalls become clearer.
Does the "cloud" mean that we interact with it via a browser? Then, more often than not the browser is the limitation.