Question: Self Taught, Where To Start With Bioinformatics Data Management?
7
gravatar for Blunders
3.5 years ago by
Blunders940
Blunders940 wrote:

BACKGROUND: It's possible that I will be hired to do basic data management for a bioinformatics company in the next week. I've got a far amount of experience in data management, but never worked in the bioinformatics industry.

GOAL: I'm looking to start rounding out my understanding of major topics related to data management within bioinformatics.

Examples: Tools, Industry Standards, Data Quality Methods, Public Data Sources, MetaData Standards, Existing Open Source Code related to the data quality/management, annotation systems, BioHDF/HDF5 , etc.

Comments/Feedback: Not sure if the question makes sense, or if it's even a good question. But it's my first here on BioStar, so please go easy on me and free feel to comment if I can provide additional information -- or update/delete the question... :-) ...one thing I might add is cheap/free is better, but that does not mean it's always the best option in terms of total cost of ownership. Again, thanks -- and feel free to post any and all information related to bioinformatics data management!

UPDATES:

  • If anyone wants to edit the tags and add "data management" please feel free to do so.
  • Related BioStar searchs: data-management
  • My long-term focus is to grow the body of resources on BioStar related to data management, since there appears to not be very many questions/answers related to the topic.
ADD COMMENTlink modified 3.3 years ago by Neilfws41k • written 3.5 years ago by Blunders940
13
gravatar for Neilfws
3.5 years ago by
Neilfws41k
Sydney, Australia
Neilfws41k wrote:

If you have experience in data management, you're ahead of many people in life sciences already. The article A Quick Guide to Organizing Computational Biology Projects is a collection of such obvious, common-sense tips as "have a sensible file/directory hierarchy". I'm not sure which is more depressing: that this was deemed worthy of an academic article, or that it needed to be written in the first place.

In terms of biological data, specifically, I think these points are worth bearing in mind:

  • There's a lot of it

An obvious point, but there is a huge amount of biological data available. 10 years ago, it was probably feasible to download most of it via FTP to local storage. Today, we have much more storage but not the bandwidth - you'd be waiting for months. So we rely more on remote data stores. Which raises the question: how do you move the computational analysis to the data? That's why people are talking up "the cloud".

  • It's growing exponentially

Search this site for "growth of biological databases", or similar. 10 years ago there was one human genome sequence (based on several individuals) - 3 Gb. In 10 years time there may be thousands. And that's just one of what, 10 million species?

  • It's extremely diverse

Biological data come from all fields of biology and in many formats. The largest amount of data is that generated by high-throughput methods: sequencing (nucleic acid and protein) and microarray technology. Structural (X-ray crystallography, NMR) and metabolic/pathway data are also major and growing components. Not to mention the literature databases and data from other areas of biology, such as ecology - geospatial/mapping, population studies.

Every field has its own formats (frequently reinvented many times) and tools. You'll need to gain a broad overview of what's relevant for you.

  • Versioning is an issue

Primary data are always being updated. So are the results of our analyses, as we develop or discover new computational tools. Data versioning is not feasible. What is feasible: maintain read-only "master copies" of primary data, version your code, then you can say "this input + this code version will reproducibly generate this output."

  • There are data standards but they are frequently ignored or abused

Biologists don't care about data standards: say "XML", you will get a blank stare. However, many attempts have been made: some are more successful than others but you'll find that they are simply not enforced. Classic example: the NCBI GEO microarray database, which purports to use standards but relaxes them (otherwise nobody would submit records), to the degree that keys/values are practically optional and arbitrary. This makes large-scale analysis across the entire dataset challenging, to say the least.

The (possibly-indexed) flat file is still king in bioinformatics.

  • There are many public resources but only a few key resources

Every year, the journal Nucleic Acids Research publishes special issues that describe hundreds of online databases and web applications. As we noted in another question, these resources are frequently not persistent. For someone coming into the field, I'd recommend that you focus on the major public resources: NCBI, EBI, Ensembl, PDB, KEGG. If your field is more specific, identify the major resources in that area (e.g. IMG for microbial genomics).

Don't expect to find great APIs for public data. They're improving, but have not even been considered until quite recently. Remember, many of these resources have been around 20 years or more, there's a lot of legacy technology behind them.

In terms of technologies that can help you. (1) Databases, of course. There's growing interest in so-called "NoSQL" solutions, but it's worth looking at older projects such as BioSQL. (2) Open-source bioinformatics libraries, particularly the so-called 'Bio*' projects: Bioperl, BioRuby, Biopython, BioJava. (3) Lots of basic Linux/UNIX command-line skills.

ADD COMMENTlink modified 2.9 years ago • written 3.5 years ago by Neilfws41k
1

@neilfws: +1 and selected as answer. Huge help, thank you!

ADD REPLYlink written 3.5 years ago by Blunders940
1

You're welcome. Go on then, select as answer :-)

ADD REPLYlink written 3.5 years ago by Neilfws41k

@neilfws: +1 oops! Thought I did, you're selected as the answer now, and gave you comment above a +1... Again, thanks!

ADD REPLYlink written 3.5 years ago by Blunders940
2
gravatar for Andrea_Bio
3.5 years ago by
Andrea_Bio2.1k
Andrea_Bio2.1k wrote:

This paper lists all of the bioinformatics buzz words that you might want to know although I don't remember much mention of ontologies which are a big thing in interoperability

http://www.ploscompbiol.org/article/info:doi%2F10.1371%2Fjournal.pcbi.1000589

ADD COMMENTlink written 3.5 years ago by Andrea_Bio2.1k
2

Ontologies are beloved of ontology developers, but not widely deployed by anyone else (e.g. biologists), in my experience.

ADD REPLYlink written 3.5 years ago by Neilfws41k
1

I think people talk about ontologies a lot and you have to appreciate that they are important otherwise you offend the people who painstakingly spend years building them, but neilfws may well be right that they aren't widely used in practice.

ADD REPLYlink written 3.5 years ago by Andrea_Bio2.1k
1

I just thought that you might also want to be aware of the notion of community curation. That is becoming more popular but naturally has a huge impact on data quality.

ADD REPLYlink written 3.5 years ago by Andrea_Bio2.1k

@andrea_bio: +1 Thanks, related statement: "Data quality, in turn, is a function of consistent analysis methodology, standard ontology, vocabularies, and dictionaries, and vetting/approval of annotations, not to mention the all-important pruning of bad content." Without knowing to bioinformatics workflow, the statement appears to be one point, and related to your point on ontological consistency being vital to interoperability.

ADD REPLYlink written 3.5 years ago by Blunders940

I think people talk about ontologies a lot and you have to appreciate that they are important otherwise you offend the people who painstakingly spend years building them, but neilfws may well be right that they aren't widely used in practise.

ADD REPLYlink written 3.5 years ago by Andrea_Bio2.1k

@andrea_bio: Yes, I agree - but also based on my experience the overhead for end users in the short-term for developing ontologies and using them is often too much; which is not to say they're not important.

ADD REPLYlink written 3.5 years ago by Blunders940
1
gravatar for Mndoci
3.5 years ago by
Mndoci1.1k
Issaquah, WA
Mndoci1.1k wrote:

One piece of advice. As you think about data architectures, etc, make sure you collaborate with a good bioinformatician. Making sure your data structures make biological sense is critical.

ADD COMMENTlink written 3.5 years ago by Mndoci1.1k

@mndoci: Yes, I'm just the "tech wiz" -- all implementations will be driven by end user requirements/needs... :-)

ADD REPLYlink written 3.5 years ago by Blunders940

@mndoci: +1 Thanks for posting, and yes, I'm just the "tech wiz" -- all implementations will be driven by end user requirements/needs... :-) –

ADD REPLYlink written 3.5 years ago by Blunders940
Please log in to add an answer.

Help
Access
  • RSS
  • Stats
  • API

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.0.0
Traffic: 635 users visited in the last hour