You can definitely use this kind of data to test the hypothesis that a positive correlation exists, but you need to perform a statistical analysis that uses all the data points and not the means directly. Here are some options I can think of:
On the face of it, one might think that a linear regression should suffice:
myLinearModel <- lm(fold_change ~ motif_count, data = df)
# Coefficients ... etc.
# F-stat: , p-value: 0.0001 [sweet!]
However, since you know you have some groups with a small number of samples (e.g., only one with 24 occurrences of the motif), the burden is going to fall on you to prove that none of the points have a disproportionate influence on the regression coefficient(s). So, you would additionally have to do some leverage or influence analysis on your resulting model. For example, checking that you don't have any high Cook's distances:
# wait...how do I interpret these again? something about 3...
Even if that works out, there's two potentially problematic issues:
- the assumption of continuity, when we really have discrete counts - maybe not so big a deal though; and
- the assumption of linearity, when we probably doubt that adding 5 motifs to a sequence already containing 20 will have the same effect as adding 5 to a one with only 2.
It might be more useful to bin the motif counts into levels, like "low", "medium", and "high", and do something similar with the fold change ("down", "neutral", "up"). You could then use
chisq.test to test for independence (null hypothesis).
If you had some idea for how to split this prior to looking at the data, that would be ideal - but you've already peaked at the data, which means you need to be careful about making biased choices in your analysis. Hand-picking bins at this point could be construed as cherry-picking your statistical test.
Another option is to do an ordinal ANOVA, such as provided by the
ordAOV method in the
ordPens package. In this approach, you won't make any assumption about the scale of effect size difference between your different motif count groups, and you'll also control for the variance within each group. To do this, you instead would use a motif count rank, in place of the motif count itself. Here's what the test would look like in R:
rankedMC <- factor(df$motif_count, ordered = TRUE)
levels(rankedMC) <- seq_along(levels(rankedMC))
rankedMC <- as.numeric(rankedMC)
# Test stat = ..., p-value = 0.0005
Another nice thing here is that the test is based on simulations from the empirical distribution, so you don't have pesky parameters or distribution assumptions to fret over.