How to check if a certain phenotype is over-represented in a cluster
1
1
Entering edit mode
2.1 years ago
nhaus ▴ 310

Hello,

I am currently analyzing bulk RNAseq data and have clustered my patients into 3 different clusters based on how similar their transcriptomic profile is. For all my samples, I have many different phenotypic labels.

My goal now is to check, if one of the identified clusters is enriched for a certain phenotype (for example being healthy).

My initial idea was to do a simple Fisher test.

As a very concrete example imagine the following scenario:

I have identified 4 different clusters with different numbers of samples in each:

Cluster Number of samples
1 41
2 32
3 29
4 26

I am interested if Cluster 1 is enriched for healthy samples. I checked and 13 of the 42 samples in cluster 1 are healthy patients, the rest (28) are unhealthy. In the 3 other clusters combined, there are 10 healthy samples and 77 unhealthy samples. Consequently, if I understand everything correctly the contingency table for my fisher test should look something like this:

13 28
10 77

If I want to test for enrichment, I simply call fisher.test(contingency_table, alternative="greater"). On the other hand, if I want to test for depletion, I call alternative="less".

I would very much appreciate it, if someone could confirm if this is indeed the way to go, or if there are more sophisticated and suitable approaches.

cluster overrepresentation • 502 views
ADD COMMENT
1
Entering edit mode
2.1 years ago

Let's call P1 the proportion of healthy in cluster 1 and P0 the proportion of healthy in the rest. The null hypothesis being tested is that P1=P0. The two-sided test will tell you whether there's a chance that P1!=P0, i.e. it's looking at the two tails of the distributions. With alternative = 'greater', you're only considering the tail of the distribution in the direction corresponding to P1>P0 and conversely, with less, you're only looking at the distribution in the direction of P1<P0. However before using a one-sided test, consider the cost of missing an effect in the other direction, i.e. do you only care about P1>P0 or is P1<P0 also of interest? As another example, imagine you're testing the effect of a treatment on patients. You want to know if the treatment is proving effective (e.g. P1>P0) so you may think of using alternative = 'greater' but in this case you wouldn't detect if the treatment is detrimental (P1<P0). Most of the time we don't know a priori the direction to expect and so I would recommend to do a two-sided test. Only do a one-sided test if the untested direction is irrelevant to the question you're interested in. For example if the conclusions you would draw are the same whether P1=P0 or P1<P0 then use alternative = 'greater' otherwise use the default alternative= 'two.sided'.

ADD COMMENT

Login before adding your answer.

Traffic: 1739 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6