Question

Forum:Large Multi-omics Project -- Dev Approach

1

Entering edit mode

4.7 years ago

CK ▴ 10

I'm working in a university research group that's drafting plans for a large multi-omics resource acting as both a data-repository and an online tool for integrative analyses of data.

As a bioinformatician, I feel the task involves aspects I'm not strictly an expert in, especially for a project of this scale: 7/8 figure budget, 200TB raw data, decisions on backend databases, frontend tech stacks, etc. We've started reaching out to cloud service providers (e.g. AWS, Google) and software-engineers within the university.

It's still early days but I'm wondering if anyone has experience in making a success of a project of this size? The plan is to move from conception through to deployment in a 2-3 year timeframe.

My initial sentiment is to bring in commercial contractors for an initial consultation (say 6 months), then to bring in backend and frontend devs (again, probably on a contractual basis) to work with bioinformaticians (postdocs mostly).

Of course there are questions about how many individuals need to be hired, which skills need to be prioritized and at which stage (e.g. database management / backend / frontend, etc.), and which technologies to use. I'd be keen to push for a no-sql / django / flask backend with microservice api's (relatively easy to train up our bioinformaticians to create api's / modules to extend functionality), but that's just me.

Would be great to hear any tips / stories from others who have made a success of a large project like this!

Thanks

snp RNA-Seq sequencing • 1.1k views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 4.7 years ago by CK ▴ 10

score 2 · Answer 1 · 2019-08-22

Just for the credentials: I've been involved in projects producing on the order of dozens of TB, like the Mitocheck project already some 10+ years ago and I am regularly involved in projects dealing with multi-TB sized data sets (mostly microscopy images). From my point of view, once you reach the tens of TB, there's not much difference between 20 TB and 200 TB, what matters is actually the granularity of the data, i.e. is it 10 million small files or 10 000 large ones? and the relationships that you need to keep track of. What you need is a serious data management plan. Without knowing the details of the project. it's hard to be specific but some things are generic enough to be mentioned. One of the first things to do is to identify the stakeholders and who is responsible for which part of data management (e.g. who will create and use the data? who will deal with which IT aspect?). Then you need to identify the data flow and access patterns: which data is going to be accessed (e.g. is it the raw data or some derived form of it? what kind of metadata is needed?), how and for what purpose? Are there access control requirements? Will the data change, if so how will this be tracked/propagated? What will happen to the data after the end of the project (funders increasingly care about this)? Also important is to define conventions/standards and document them (e.g. which file formats are going to be used, structure and naming conventions for the project directories and APIs). There's more and all this may influence decision about the technologies you may want to use (cloud or no cloud, sql or nosql...). I would think bringing in commercial contractors for consultation is a waste of money unless they have actually demonstrable experience with the type of project you want to run. You can come up with the/a solution by sufficient brainstorming and asking the right people just like you've started with this post.