I am doing some statistical analysis of RNASeq data and I have run into a bit of a problem. Instead of looking at comparison of individual genes I want to compare groups of genes. The setup is as follows: I have two independent groups of genes. For each gene I have three measures from three biological replicates. I want to compare expression between groups. Sample table with design: https://ibb.co/myCssd
Right now I'm averaging biological replicates and performing Mann-Whitney U-test, but I was wondering if there is a better solution? The other problem is that I can not really compare absolute expression values like FPKM, so I'm looking at % maximum expression instead. This works nicely on averages, but would not really work on individual replicates due to outliers...
Any suggestions would be greatly appreciated.
A bit of background: I have 5 time points, 3 replicates each. I know that between time point 2 and 3 there is a transcription factor which is expressed. I want to see the effect it has on the target genes, compared to the control genes at time point 3. I essentially want to see if there is a difference in expression between target (bound by transcription factor) and control (not bound by the transcription factor) genes at time point 3. At this point I calculate the mean of three replicates for all the genes. Split the genes into target and control sets and then use Mann-Whitney U test to see if there is a difference in underlying distributions. Also, because the absolute values are difficult to compare I scale them to % of maximum expression. So, I’m essentially comparing the difference in distributions of % of maximum expression at point 3. One of my problems is that instead of incorporating the replicates I’m just using the mean. I was wondering if there is a cleaner way performing analysis like that?
As I understand, you want to compare the combined effect of more than one gene at the same time? So, it would be something like:
Gene1+Gene2+Gene3 in GroupA Versus Gene1+Gene2+Gene3 in GroupB ?
...or are you literally just looking for a singl p-value for each gene?
If you could at all possible obtain raw counts, that would be beneficial.
Yes. Option one. But the groups are independent so it's group1: gene1+gene2+gene3 vs group2: gene4+gene5+gene6. I essentially want to see if the expression in group2 is lower/higher than group1 at a given point. I have added some background information. Hopefully that could help?
Hello, thanks for adding the background - seems very interesting. The experiment would obviously be better performed in a 'robust' DE analysis tool, like EdgeR, DESeq2, etc. However, I understand that you may not have the raw counts such that you could utiliise these tools.
I'm not sure if it's exactly what you could use but I recently posted about the application of the Wald Test to a fitted regression model: A: Differential Expression analysis using Wald-Test
In such a case, you'd do:
This would derive a single p-value for
gene1 + gene2 + gene3
(cofficients / terms 2 to 4) in relation to their ability to distinguish TF binding as encoded in the TFbinding categorical variable. If TFbinding is a continous representation of TF binding, then you should switch the model to a linear regressionlm()
Not sure if that helps at all.