In GSEA, how can one "correlate" gene expression with categorical phenotype data?
1
0
Entering edit mode
4.6 years ago
O.rka ▴ 660

The magnitude of the increment depends on the correlation of the gene with the phenotype. The enrichment score is the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov–Smirnov-like statistic (ref. 7 and Fig. 1B). http://www.pnas.org/content/102/43/15545

How does one calculate the correlation between gene expression (continuous values) and categorical phenotype data (strings or binary encoded data)?

For example, suppose one had the following data:

gene_expr_A = [0.3, 0.5, 0.8, 13.0, 12.3, 15.8]
phenotypes = ["healthy", "healthy", "healthy", "diseased", "diseased", "diseased"]


Would this correlation be calculated like this?

phenotypes_encoded = [0,0,0,1,1,1]
correlation = pearson(gene_expr_A, phenotypes_encoded)


Is this statistically robust? I feel like this oversimplifies the operations.

gsea correlation pearson genes expression • 2.3k views
3
Entering edit mode
4.6 years ago

Hey,

Most things are simple when broken down and understood. I think that it is reasonable to correlate a continuous variable to the numerical equivalent of a categorical variable. This is what happens with a linear regression, after all. Take a look:

continuous <- c(45, 67, 12, 65, 75, 3, 44, 90)
categorical <- factor(c(0,0,0,0,1,1,1,1))

cor(continuous, as.numeric(categorical)) ^ 2
 0.01024737

summary(lm(continuous ~ categorical))\$r.squared
 0.01024737


Between both of these, I would feel more comfortable reporting the regression values along with the p-value from the regression fit. Be careful on the interpretation of what this 'correlation' means, too.

Kevin

0
Entering edit mode

Hey Kevein, thanks this is making a lot of sense once I plotted data. I don't know why but it always seemed incorrect to look at correlations in this way but you're right and I definitely get it now.

Also, your linear model at the end is interesting. I'm pretty new to linear models used in this way so please let me know if I understand this correctly.

y = beta*x + bias_constant + epsilon

where y is the gene expression value from continuous, beta is the coefficient multiplied against x which is either 0 or 1 depending on the phenotype, bias_constant is the y intercept, and episilon is some normally distributed error. The fit for the model measured by R^2 is the pearson correlation between the 2 vectors squared?

I've seen R^2 that are between -1 and 1. How would the negative R^2 values be computed in this way?

Thanks again.

1
Entering edit mode

The negative r-squared is explained very well here: https://stats.stackexchange.com/questions/12900/when-is-r-squared-negative

For your other questions, I also point you to other material:

[Biostars is more for general bioinformatics, not statistics]

Remember, of course, that cor() and lm() will only produce the same value in a select few cases.

1
Entering edit mode

I will keep that in mind for the future. Thanks for the help even though this was out of scope. These links are really useful.