My job profile closely matches part of what you describe. I work on integrating lab reports and clinical data, along with patient genetics to form a disease-specific knowledge base.
I think this approach is a great idea for diseases where the underlying gene is known and the causative mutations are hopefully finite in number. The challenge is in mining clinical reports, where each report may present similar symptoms and diagnoses/conclusions in different ways owing to record creator and patient perception biases. Another factor is to normalize biochemical assays across locations that might use different protocols.
There are many other factors that make the data a bit difficult to work with - especially when it comes to automated analyses. Even with future proofed historical data, I end up spending a considerable amount of time cleaning and getting it into shape.
And of course, the most important issue - that of patient confidentiality - even with anonymization, one might end up presenting data that breaks HIPAA in some obscure way, and that is a risk not many institutions are willing to take.
OP, how do you plan on addressing these concerns? IMO, deciding algorithms to use before the problem statement has been unambiguously defined and the data looked at in-depth is not ideal - I say this from experience. We are better off looking at the data, the live nature of the knowledgebase and working on an optimized process to get to the end product.
What's the question ? If you want to discuss this topic, I think this would be best posted in the forum section.
Sorry, I should have posted in forum. I didn't notice that.
Moved to Forum.