Question: In GSEA, how can one "correlate" gene expression with categorical phenotype data?
gravatar for O.rka
18 months ago by
O.rka170 wrote:

The magnitude of the increment depends on the correlation of the gene with the phenotype. The enrichment score is the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov–Smirnov-like statistic (ref. 7 and Fig. 1B).

How does one calculate the correlation between gene expression (continuous values) and categorical phenotype data (strings or binary encoded data)?

For example, suppose one had the following data:

gene_expr_A = [0.3, 0.5, 0.8, 13.0, 12.3, 15.8]
phenotypes = ["healthy", "healthy", "healthy", "diseased", "diseased", "diseased"]

Would this correlation be calculated like this?

phenotypes_encoded = [0,0,0,1,1,1]
correlation = pearson(gene_expr_A, phenotypes_encoded)

Is this statistically robust? I feel like this oversimplifies the operations.

ADD COMMENTlink modified 18 months ago by Kevin Blighe60k • written 18 months ago by O.rka170
gravatar for Kevin Blighe
18 months ago by
Kevin Blighe60k
Kevin Blighe60k wrote:


Most things are simple when broken down and understood. I think that it is reasonable to correlate a continuous variable to the numerical equivalent of a categorical variable. This is what happens with a linear regression, after all. Take a look:

continuous <- c(45, 67, 12, 65, 75, 3, 44, 90)
categorical <- factor(c(0,0,0,0,1,1,1,1))

cor(continuous, as.numeric(categorical)) ^ 2
[1] 0.01024737

summary(lm(continuous ~ categorical))$r.squared
[1] 0.01024737

Between both of these, I would feel more comfortable reporting the regression values along with the p-value from the regression fit. Be careful on the interpretation of what this 'correlation' means, too.


ADD COMMENTlink written 18 months ago by Kevin Blighe60k

Hey Kevein, thanks this is making a lot of sense once I plotted data. I don't know why but it always seemed incorrect to look at correlations in this way but you're right and I definitely get it now.

Also, your linear model at the end is interesting. I'm pretty new to linear models used in this way so please let me know if I understand this correctly.

y = beta*x + bias_constant + epsilon

where y is the gene expression value from continuous, beta is the coefficient multiplied against x which is either 0 or 1 depending on the phenotype, bias_constant is the y intercept, and episilon is some normally distributed error. The fit for the model measured by R^2 is the pearson correlation between the 2 vectors squared?

I've seen R^2 that are between -1 and 1. How would the negative R^2 values be computed in this way?

Thanks again.

ADD REPLYlink written 18 months ago by O.rka170

The negative r-squared is explained very well here:

For your other questions, I also point you to other material:

[Biostars is more for general bioinformatics, not statistics]

Remember, of course, that cor() and lm() will only produce the same value in a select few cases.

ADD REPLYlink written 18 months ago by Kevin Blighe60k

I will keep that in mind for the future. Thanks for the help even though this was out of scope. These links are really useful.

ADD REPLYlink written 18 months ago by O.rka170
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 922 users visited in the last hour