The magnitude of the increment depends on the correlation of the gene with the phenotype. The enrichment score is the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorovâ€“Smirnov-like statistic (ref. 7 and Fig. 1B). http://www.pnas.org/content/102/43/15545

**How does one calculate the correlation between gene expression (continuous values) and categorical phenotype data (strings or binary encoded data)?**

For example, suppose one had the following data:

```
gene_expr_A = [0.3, 0.5, 0.8, 13.0, 12.3, 15.8]
phenotypes = ["healthy", "healthy", "healthy", "diseased", "diseased", "diseased"]
```

Would this correlation be calculated like this?

```
phenotypes_encoded = [0,0,0,1,1,1]
correlation = pearson(gene_expr_A, phenotypes_encoded)
```

Is this statistically robust? I feel like this oversimplifies the operations.

Hey Kevein, thanks this is making a lot of sense once I plotted data. I don't know why but it always seemed incorrect to look at correlations in this way but you're right and I definitely get it now.

Also, your linear model at the end is interesting. I'm pretty new to linear models used in this way so please let me know if I understand this correctly.

y = beta*x + bias_constant + epsilon

where

`y`

is the gene expression value from`continuous`

,`beta`

is the coefficient multiplied against`x`

which is either 0 or 1 depending on the phenotype,`bias_constant`

is the y intercept, and`episilon`

is some normally distributed error. The fit for the model measured by R^2 is the pearson correlation between the 2 vectors squared?I've seen R^2 that are between -1 and 1. How would the negative R^2 values be computed in this way?

Thanks again.

The negative r-squared is explained very well here: https://stats.stackexchange.com/questions/12900/when-is-r-squared-negative

For your other questions, I also point you to other material:

[Biostars is more for general bioinformatics, not statistics]Remember, of course, that cor() and lm() will only produce the same value in a select few cases.

I will keep that in mind for the future. Thanks for the help even though this was out of scope. These links are really useful.