Question

Using Statistical enrichment test for groups Principal Component Analysis

0

Entering edit mode

3.0 years ago

Gabriel ▴ 150

Hello.

I am analyzing a set of samples from different experimental groupings, and I do Principal Component Analysis to differentiate the experimental groups visually. However I notice that certain groups have the highest loading scores for certain principal components, f.ex. PC1. And I would like to prove this statistically.

Assume I have a factor for experimental groups:

[ A A A A A A B B B B B B C C C C C C C C C C C C ]

And want to correlate them to PC1, which has some scores...

[ -0.12 -0.52 -0.12 ... etc ... 0.64 0.11 0.69 0.33 ]

As can be seen, group "C" has higher scores. What is the most ideal , or commonly used statistical test to show this?

Currently, I have tried simple pearson correlation, setting all groups other than C as 0, and C as 1. This however is not ideal if there is a lot of variance between groups.

I also thought of doing logistic regression, I tried it but it fails if the groups are perfectly separated and isn't really useful for small sample sizes.

So I am going to try to do a simple z-score and then a Welch's t-test to obtain the p-value (if number of samples was the same in each group, it could be a paired t-test). However I didn't really find any examples online of others doing the same, maybe this https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4015128/ but nothing clear.

Am I justified in using such a statistical test or do I want to take another approach?

PCA • 1.3k views

ADD COMMENT • link 3.0 years ago by Gabriel ▴ 150

0

Entering edit mode

I'm not sure what you mean by "I also thought of doing logistic regression, I tried it but it fails if the groups are perfectly separated and isn't really useful for small sample sizes." Logistic regression should work, and your groups being perfectly separated should make it work even better.

ADD REPLY • link 3.0 years ago by i.sudbery 19k

0

Entering edit mode

Hello, it should, but there is a problem with the logistic algorithm, it seems it doesn't converge if the groups are perfectly separated. See the regression in this image: https://files.catbox.moe/kyjhk5.png

It looks good, however I am getting an error message:

fit = glm(admit ~ vektorr, data=mydata2, family=binomial)

Warning messages: 1: glm.fit: algorithm did not converge 2: glm.fit: fitted probabilities numerically 0 or 1 occurred

My data looks like this

mydata2[["admit"]] 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

round( mydata2[["vektorr"]] , digits=2) 0.33 0.27 0.19 0.28 0.37 0.24 0.04 0.02 -0.08 0.04 -0.07 -0.01 0.06 0.00 -0.03 0.03 -0.01 -0.16 -0.31 -0.28 -0.36 -0.29 -0.27

You can try it for yourself and see what happens.

It is not just me who has had this problem, see:

https://stats.stackexchange.com/questions/254124/why-does-logistic-regression-become-unstable-when-classes-are-well-separated https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression

ADD REPLY • link 3.0 years ago by Gabriel ▴ 150

score 0 · Answer 1 · 2021-03-31

0

Entering edit mode

3.0 years ago

jared.andrews07 ★ 16k

What's your end goal here? While this is technically possible, it may be pretty unnecessary. If you really want to do so, you can try a regression model to classify your conditions based on PC1. See this answer by @kevin. Alternatively, you can just use something like pvclust to cluster your samples and show p-values for each sub-tree.

ADD COMMENT • link 3.0 years ago by jared.andrews07 ★ 16k

0

Entering edit mode

My end goal is to detect which PC's are correlated, or enriched in specific experimental groups.

I am already trying the logistic regression model, but it seems to fail to converge when there is perfect separation of the groups, see my response to @i.sudbery

ADD REPLY • link 3.0 years ago by Gabriel ▴ 150

score 0 · Answer 2 · 2021-04-01

If you want a test I would recommend Mann-Whitney rather than t-test as you have no reason to believe that the PC scores for each sample are normally distributed. It is true that a non-parametric test is less powerful than something like a t-test, but this lack of power is reflecting a real property of your data - with smaller data, more extreme results are likely to emerge by chance.