An alternative way to phrase this question: How do you increase the quality (specificity & sensitivity) of pathway analysis results?
Our group often uses a pathway analysis approach to try to understand our data in more global terms. This has resulted in some interesting results in the past, so we keep at it. We may start with one of these 2 datasets:
1) gene expression data from an organ of interest (in our case, brain or microdissected part of the brain) at different developmental timepoints -- possibly with one key transcription factor knocked out, and compare to controls
2) gene content from de novo copy number variants from patients with a phenotype of interest
Both datasets result in a list of genes that are either (case 1) differentially expressed, or (case 2) present in altered copy number in patients. A typical next step is to ask pathway analysis to help identify other genes that may be involved in a given phenotype. I like to think this gives us a "holistic" approach, but sometimes am not so sure. Like many groups, I'll bet, we have paid for an Ingenuity license, but that often gives some humorous results, such as the time a colleague of mine kept getting back a pathway where multiple genes were related via interaction with "RNA polymerase II" (we suspect that when viewed this way, many many genes are related via interaction with RNA polymerase!) -- not so helpful.
But more frequently, the analysis algorithm provides some gene-gene relationship that is irrelevant to our particular organ of interest... ("That's nice, but those genes are never expressed in the brain...") Even more frequently, relationships are identified between genes, and we have no data to show they are ever expressed in the same cell type at the same time in development. This results in a large amount of manual curation, and we're left wondering what part of systems biology is actually automated! There are the "academic" pathway analysis tools, with algorithms such as PathwayCommons, String, GeneMania, VisANT, and others -- but these are also limited by some true relationships that have been demonstrated and published by multiple labs not showing up. The bottom line -- there are a lot of false negatives and false positives with this in silico approach. We turn to the wet-lab to sort out some of these leads, but can't chase them all down, and still need some way to identify the highest yield genes for further validation experiments.
How are labs dealing with the lack of biological context data in current pathway analysis algorithms? Are there other tools that people have developed internally?
We have been left with the conclusion that the best next step is building databases of organ-specific (and cell-type-specific) gene expression datasets from embryonic knock-downs in model animal systems - and then build a pathway analysis algorithm from the bottom up that will be more sensitive to biological context. This should improve specificity and sensitivity. I am starting to think that any pathway analysis algorithm that is not based on these kinds of organ-specific and timepoint-specific data is not so useful. Any other ideas?
Edit: Larry Parnell mentions DAPPLE in his answer below, so I've added a link to their site.