is it meaningless for comparing averages from data with different sample size?
Entering edit mode
4.9 years ago
boaty ▴ 200

Hi guys,

I am working on counting motifs for a set of sequence. Now I generated this kind of table, RNA motif associated with FC

      motif_count             fold_change
RNA1      15                       0.8
RNA2      24                       4.5
RNA3      1                       -1.2
RNA1000   10                       0.3

My goal is to find if there is a positive relationship between motif count and FC: more count means more RNA enriched? I plotted a figure: plot( mean(fold change) , Motif_count(i = 1:24) ) which perfectly accords with our hypotheses.

But I am worried a lot about the huge difference in sample size. For example:

If motif count == 1, I have 4500 RNA and their FC varies from -4 to 4, mean == -0.8

But for motif count == 15, I only have 35 RNA, FC from 1.5 to 2.5, mean == 2

So there is a huge difference in sample size and comparing their means, it feels so wrong.......

here comes my question, can I use this data to confirm my hypotheses of positive relationship between count and FC? thanks a lot

RNA-Seq motif finding enrichment R statisctics • 2.8k views
Entering edit mode
4.9 years ago
mmfansler ▴ 410

You can definitely use this kind of data to test the hypothesis that a positive correlation exists, but you need to perform a statistical analysis that uses all the data points and not the means directly. Here are some options I can think of:

Linear model

On the face of it, one might think that a linear regression should suffice:

# R
myLinearModel <- lm(fold_change ~ motif_count, data = df)
# Coefficients ... etc.
# F-stat: , p-value: 0.0001

However, since you know you have some groups with a small number of samples (e.g., only one with 24 occurrences of the motif), the burden is going to fall on you to prove that none of the points have a disproportionate influence on the regression coefficient(s). So, you would additionally have to do some leverage or influence analysis on your resulting model. For example, checking that you don't have any high Cook's distances:

# do I interpret these again? something about 3...


Even if that works out, there are two potentially problematic issues:

  1. the assumption of continuity, when we really have discrete counts - maybe not so big a deal though; and
  2. the assumption of linearity, when we probably doubt that adding 5 motifs to a sequence already containing 20 will have the same effect as adding 5 to a one with only 2.

It might be more useful to bin the motif counts into levels, like "low", "medium", and "high", and do something similar with the fold change ("down", "neutral", "up"). You could then use chisq.test to test for independence (null hypothesis).

If you had some idea for how to split this prior to looking at the data, that would be ideal - but you've already peeked at the data, which means you need to be careful about making biased choices in your analysis. Hand-picking bins at this point could be construed as cherry-picking your statistical test.


Another option is to do an ordinal ANOVA, such as provided by the ordAOV method in the ordPens package. In this approach, you won't make any assumption about the scale of effect size difference between your different motif count groups, and you'll also control for the variance within each group. To do this, you instead would use a motif count rank, in place of the motif count itself. Here's what the test would look like in R:


rankedMC <- factor(df$motif_count, ordered = TRUE)
levels(rankedMC) <- seq_along(levels(rankedMC))
rankedMC <- as.numeric(rankedMC)

ordAOV(rankedMC, df$fold_change)
# Test stat = ..., p-value = 0.0005

Another nice thing here is that the test is based on simulations from the empirical distribution, so you don't have pesky parameters or distribution assumptions to fret over.

Entering edit mode

![enter image description here][1]thank your mmfansler,

the linear model was the first thing i thought to find out the relationship between count and FC, but i give up after plot all points on the same figure, it is not linear at all. Your suggestion about cocks.distance() is great, at least gives me some ideas about these outliers.

binning and ordering also help me understanding what the my data is. Thanks a lot


Login before adding your answer.

Traffic: 2120 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6