BACKGROUND: It's possible that I will be hired to do basic data management for a bioinformatics company in the next week. I've got a far amount of experience in data management, but never worked in the bioinformatics industry.
GOAL: I'm looking to start rounding out my understanding of major topics related to data management within bioinformatics.
Examples: Tools, Industry Standards, Data Quality Methods, Public Data Sources, MetaData Standards, Existing Open Source Code related to the data quality/management, annotation systems, BioHDF/HDF5 , etc.
Comments/Feedback: Not sure if the question makes sense, or if it's even a good question. But it's my first here on BioStar, so please go easy on me and free feel to comment if I can provide additional information -- or update/delete the question... :-) ...one thing I might add is cheap/free is better, but that does not mean it's always the best option in terms of total cost of ownership. Again, thanks -- and feel free to post any and all information related to bioinformatics data management!
If you have experience in data management, you're ahead of many people in life sciences already. The article A Quick Guide to Organizing Computational Biology Projects is a collection of such obvious, common-sense tips as "have a sensible file/directory hierarchy". I'm not sure which is more depressing: that this was deemed worthy of an academic article, or that it needed to be written in the first place.
In terms of biological data, specifically, I think these points are worth bearing in mind:
An obvious point, but there is a huge amount of biological data available. 10 years ago, it was probably feasible to download most of it via FTP to local storage. Today, we have much more storage but not the bandwidth - you'd be waiting for months. So we rely more on remote data stores. Which raises the question: how do you move the computational analysis to the data? That's why people are talking up "the cloud".
Search this site for "growth of biological databases", or similar. 10 years ago there was one human genome sequence (based on several individuals) - 3 Gb. In 10 years time there may be thousands. And that's just one of what, 10 million species?
Biological data come from all fields of biology and in many formats. The largest amount of data is that generated by high-throughput methods: sequencing (nucleic acid and protein) and microarray technology. Structural (X-ray crystallography, NMR) and metabolic/pathway data are also major and growing components. Not to mention the literature databases and data from other areas of biology, such as ecology - geospatial/mapping, population studies.
Every field has its own formats (frequently reinvented many times) and tools. You'll need to gain a broad overview of what's relevant for you.
Primary data are always being updated. So are the results of our analyses, as we develop or discover new computational tools. Data versioning is not feasible. What is feasible: maintain read-only "master copies" of primary data, version your code, then you can say "this input + this code version will reproducibly generate this output."
Biologists don't care about data standards: say "XML", you will get a blank stare. However, many attempts have been made: some are more successful than others but you'll find that they are simply not enforced. Classic example: the NCBI GEO microarray database, which purports to use standards but relaxes them (otherwise nobody would submit records), to the degree that keys/values are practically optional and arbitrary. This makes large-scale analysis across the entire dataset challenging, to say the least.
The (possibly-indexed) flat file is still king in bioinformatics.
Every year, the journal Nucleic Acids Research publishes special issues that describe hundreds of online databases and web applications. As we noted in another question, these resources are frequently not persistent. For someone coming into the field, I'd recommend that you focus on the major public resources: NCBI, EBI, Ensembl, PDB, KEGG. If your field is more specific, identify the major resources in that area (e.g. IMG for microbial genomics).
Don't expect to find great APIs for public data. They're improving, but have not even been considered until quite recently. Remember, many of these resources have been around 20 years or more, there's a lot of legacy technology behind them.
In terms of technologies that can help you. (1) Databases, of course. There's growing interest in so-called "NoSQL" solutions, but it's worth looking at older projects such as BioSQL. (2) Open-source bioinformatics libraries, particularly the so-called "Bio*" projects: Bioperl, BioRuby, Biopython, BioJava. (3) Lots of basic Linux/UNIX command-line skills.
This paper lists all of the bioinformatics buzz words that you might want to know although I don't remember much mention of ontologies which are a big thing in interoperability