Question

How to check if a certain phenotype is over-represented in a cluster

1

Entering edit mode

3.6 years ago

nhaus ▴ 420

Hello,

I am currently analyzing bulk RNAseq data and have clustered my patients into 3 different clusters based on how similar their transcriptomic profile is. For all my samples, I have many different phenotypic labels.

My goal now is to check, if one of the identified clusters is enriched for a certain phenotype (for example being healthy).

My initial idea was to do a simple Fisher test.

As a very concrete example imagine the following scenario:

I have identified 4 different clusters with different numbers of samples in each:

Cluster	Number of samples
1	41
2	32
3	29
4	26

I am interested if Cluster 1 is enriched for healthy samples. I checked and 13 of the 42 samples in cluster 1 are healthy patients, the rest (28) are unhealthy. In the 3 other clusters combined, there are 10 healthy samples and 77 unhealthy samples. Consequently, if I understand everything correctly the contingency table for my fisher test should look something like this:


13	28
10	77

If I want to test for enrichment, I simply call fisher.test(contingency_table, alternative="greater"). On the other hand, if I want to test for depletion, I call alternative="less".

I would very much appreciate it, if someone could confirm if this is indeed the way to go, or if there are more sophisticated and suitable approaches.

cluster overrepresentation • 1.0k views

ADD COMMENT • link updated 3.6 years ago by Jean-Karim Heriche 27k • written 3.6 years ago by nhaus ▴ 420

score 1 · Answer 1 · 2022-03-21

Let's call P1 the proportion of healthy in cluster 1 and P0 the proportion of healthy in the rest. The null hypothesis being tested is that P1=P0. The two-sided test will tell you whether there's a chance that P1!=P0, i.e. it's looking at the two tails of the distributions. With alternative = 'greater', you're only considering the tail of the distribution in the direction corresponding to P1>P0 and conversely, with less, you're only looking at the distribution in the direction of P1<P0. However before using a one-sided test, consider the cost of missing an effect in the other direction, i.e. do you only care about P1>P0 or is P1<P0 also of interest? As another example, imagine you're testing the effect of a treatment on patients. You want to know if the treatment is proving effective (e.g. P1>P0) so you may think of using alternative = 'greater' but in this case you wouldn't detect if the treatment is detrimental (P1<P0). Most of the time we don't know a priori the direction to expect and so I would recommend to do a two-sided test. Only do a one-sided test if the untested direction is irrelevant to the question you're interested in. For example if the conclusions you would draw are the same whether P1=P0 or P1<P0 then use alternative = 'greater' otherwise use the default alternative= 'two.sided'.