If you have experience in data management, you're ahead of many people in life sciences already. The article A Quick Guide to Organizing Computational Biology Projects is a collection of such obvious, common-sense tips as "have a sensible file/directory hierarchy". I'm not sure which is more depressing: that this was deemed worthy of an academic article, or that it needed to be written in the first place.
In terms of biological data, specifically, I think these points are worth bearing in mind:
An obvious point, but there is a huge amount of biological data available. 10 years ago, it was probably feasible to download most of it via FTP to local storage. Today, we have much more storage but not the bandwidth - you'd be waiting for months. So we rely more on remote data stores. Which raises the question: how do you move the computational analysis to the data? That's why people are talking up "the cloud".
- It's growing exponentially
Search this site for "growth of biological databases", or similar. 10 years ago there was one human genome sequence (based on several individuals) - 3 Gb. In 10 years time there may be thousands. And that's just one of what, 10 million species?
Biological data come from all fields of biology and in many formats. The largest amount of data is that generated by high-throughput methods: sequencing (nucleic acid and protein) and microarray technology. Structural (X-ray crystallography, NMR) and metabolic/pathway data are also major and growing components. Not to mention the literature databases and data from other areas of biology, such as ecology - geospatial/mapping, population studies.
Every field has its own formats (frequently reinvented many times) and tools. You'll need to gain a broad overview of what's relevant for you.
Primary data are always being updated. So are the results of our analyses, as we develop or discover new computational tools. Data versioning is not feasible. What is feasible: maintain read-only "master copies" of primary data, version your code, then you can say "this input + this code version will reproducibly generate this output."
- There are data standards but they are frequently ignored or abused
Biologists don't care about data standards: say "XML", you will get a blank stare. However, many attempts have been made: some are more successful than others but you'll find that they are simply not enforced. Classic example: the NCBI GEO microarray database, which purports to use standards but relaxes them (otherwise nobody would submit records), to the degree that keys/values are practically optional and arbitrary. This makes large-scale analysis across the entire dataset challenging, to say the least.
The (possibly-indexed) flat file is still king in bioinformatics.
- There are many public resources but only a few key resources
Every year, the journal Nucleic Acids Research publishes special issues that describe hundreds of online databases and web applications. As we noted in another question, these resources are frequently not persistent. For someone coming into the field, I'd recommend that you focus on the major public resources: NCBI, EBI, Ensembl, PDB, KEGG. If your field is more specific, identify the major resources in that area (e.g. IMG for microbial genomics).
Don't expect to find great APIs for public data. They're improving, but have not even been considered until quite recently. Remember, many of these resources have been around 20 years or more, there's a lot of legacy technology behind them.
In terms of technologies that can help you. (1) Databases, of course. There's growing interest in so-called "NoSQL" solutions, but it's worth looking at older projects such as BioSQL. (2) Open-source bioinformatics libraries, particularly the so-called 'Bio*' projects: Bioperl, BioRuby, Biopython, BioJava. (3) Lots of basic Linux/UNIX command-line skills.