This paper by Gibbons and Roth (2002) describes a method of verifying clusters by checking their mutual information against GO terms. A cluster/annotation contingency matrix is produced, indicating that for cluster r and GO term c, each element indicates the number of occurrences of a specific GO term (the column) for the genes in that group. Then, mutual information is calculated. This is best visualized from this graphic in Steuer et al. (2006)
My question is how to calculate this mutual information value. I have such a contingency matrix, and know how to calculate mutual information, but Gibbons et al (and Steuer et al too) use an approximation, and I'm unsure of their notation.
The MI for a cluster is additive under their (maybe too strong assumptions):
I(C, [A1, A2]) = I(C, A1) + I(C, A2)
Each I(C, Ai) = H(C) + H(Ai) - H(C, Ai)
What I'm confused about is (1) Why no subscript on C? How is H(C, Ai) calculated when Ai corresponds to one column, and C (seems to) correspond to all of the clusters? With one column, how do we get the joint distribution? (2) How is H(C) calculated? Is it across all attributes?
If you want to show an example with the contingency table in the graphic, I'd be forever grateful!