Question: is it meaningless for comparing averages from data with different sample size?
1
14 months ago by
boaty60
boaty60 wrote:

Hi guys,

I am working on counting motifs for a set of sequence. Now I generated this kind of table, RNA motif associated with FC

``````      motif_count             fold_change
RNA1      15                       0.8
RNA2      24                       4.5
RNA3      1                       -1.2
...
RNA1000   10                       0.3
``````

My goal is to find if there is a positive relationship between motif count and FC: more count means more RNA enriched? I plotted a figure: plot( mean(fold change) , Motif_count(i = 1:24) ) which perfectly accords with our hypotheses.

But I am worried a lot about the huge difference in sample size. For example:

If motif count == 1, I have 4500 RNA and their FC varies from -4 to 4, mean == -0.8

But for motif count == 15, I only have 35 RNA, FC from 1.5 to 2.5, mean == 2

So there is a huge difference in sample size and comparing their means, it feels so wrong.......

here comes my question, can I use this data to confirm my hypotheses of positive relationship between count and FC? thanks a lot

modified 14 months ago by mmfansler270 • written 14 months ago by boaty60
5
14 months ago by
mmfansler270
MSKCC | New York, NY
mmfansler270 wrote:

You can definitely use this kind of data to test the hypothesis that a positive correlation exists, but you need to perform a statistical analysis that uses all the data points and not the means directly. Here are some options I can think of:

Linear model

On the face of it, one might think that a linear regression should suffice:

``````# R
myLinearModel <- lm(fold_change ~ motif_count, data = df)
summary(myLinearModel)
# Coefficients ... etc.
# F-stat: , p-value: 0.0001
``````

However, since you know you have some groups with a small number of samples (e.g., only one with 24 occurrences of the motif), the burden is going to fall on you to prove that none of the points have a disproportionate influence on the regression coefficient(s). So, you would additionally have to do some leverage or influence analysis on your resulting model. For example, checking that you don't have any high Cook's distances:

``````cooks.distance(myLinearModel)
# wait...how do I interpret these again? something about 3...
``````

Binning

Even if that works out, there are two potentially problematic issues:

1. the assumption of continuity, when we really have discrete counts - maybe not so big a deal though; and
2. the assumption of linearity, when we probably doubt that adding 5 motifs to a sequence already containing 20 will have the same effect as adding 5 to a one with only 2.

It might be more useful to bin the motif counts into levels, like "low", "medium", and "high", and do something similar with the fold change ("down", "neutral", "up"). You could then use `chisq.test` to test for independence (null hypothesis).

If you had some idea for how to split this prior to looking at the data, that would be ideal - but you've already peeked at the data, which means you need to be careful about making biased choices in your analysis. Hand-picking bins at this point could be construed as cherry-picking your statistical test.

Ordering

Another option is to do an ordinal ANOVA, such as provided by the `ordAOV` method in the `ordPens` package. In this approach, you won't make any assumption about the scale of effect size difference between your different motif count groups, and you'll also control for the variance within each group. To do this, you instead would use a motif count rank, in place of the motif count itself. Here's what the test would look like in R:

``````library(ordPens)

rankedMC <- factor(df\$motif_count, ordered = TRUE)
levels(rankedMC) <- seq_along(levels(rankedMC))
rankedMC <- as.numeric(rankedMC)

ordAOV(rankedMC, df\$fold_change)
# Test stat = ..., p-value = 0.0005
``````

Another nice thing here is that the test is based on simulations from the empirical distribution, so you don't have pesky parameters or distribution assumptions to fret over.

![enter image description here][1]thank your mmfansler,

the linear model was the first thing i thought to find out the relationship between count and FC, but i give up after plot all points on the same figure, it is not linear at all. Your suggestion about cocks.distance() is great, at least gives me some ideas about these outliers.

binning and ordering also help me understanding what the my data is. Thanks a lot