The magnitude of the increment depends on the correlation of the gene with the phenotype. The enrichment score is the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov–Smirnov-like statistic (ref. 7 and Fig. 1B). http://www.pnas.org/content/102/43/15545
How does one calculate the correlation between gene expression (continuous values) and categorical phenotype data (strings or binary encoded data)?
For example, suppose one had the following data:
gene_expr_A = [0.3, 0.5, 0.8, 13.0, 12.3, 15.8] phenotypes = ["healthy", "healthy", "healthy", "diseased", "diseased", "diseased"]
Would this correlation be calculated like this?
phenotypes_encoded = [0,0,0,1,1,1] correlation = pearson(gene_expr_A, phenotypes_encoded)
Is this statistically robust? I feel like this oversimplifies the operations.