Hello all, I have a problem figuring how to analyze my Affymetrix dataset the correct way; I have a count table of 20 864 genes expressed as log2 CPM with 2964 samples from 12 different groups with UNEQUAL number of samples/group (from 28 to 550 samples per group). Then I want to isolate DEG (differentially expressed genes) between ONE group and the rest of the samples (I am not interested in DEG between group 1 and say group 4 or 10 just the DEG between group 1 and the rest of the groups combined). If I calculate a mean gene expression of my group of interest (65 samples) with the mean of all the other samples, the group with the higher number of samples is going to 'weigh' more in the mean than poorly represented groups and that would biased my result right (same for a t test)? So I calculated the mean for each group and then did a mean of the mean. But then I am left with only two columns (mean of my group vs mean of the rest) which is not proper for wilcox t testing. So my question (finally) is: how do I analyze this dataset taking into account the different number of samples per group without going through the means? MANOVA?limma? other? Thank you so much for any suggestion and my apolopgies if it is trivial. ps I don't have access to the CEL files or raw data
Could you break it into two steps, and incorporate random sampling as follows? Assuming the goal is to identify genes which differ simply between each group and the whole data set, you could create representative data sets by for each group by downsampling all larger groups to 28 observations per gene (i.e. if a group has 150 samples, for each gene randomly select 28 from the available 150 values for that gene). Then do your analysis to see which genes show differences from the whole by group. Then repeat it a few times, to see how stable the resulting gene sets are. Or further evaluate the variance of those genes in each complete data set to see if anything funky is happening. This would allow you to get a sense of the interesting genes. (FWIW, I'm an experimentalist, not a statistician, so this is just an idea, not a robust method). Then again, I'm not sure any of this is really necessary...can't you just use Limma? It doesn't care if the replicate numbers are different and will weigh things appropriately, IIRC.