Hi, I'm using LIMMA for DE analysis on the METABRIC dataset. I divided the sample in four class, and I want to know which genes are deferentially expressed among each group. With RStudio I wrote this simple script:
d
data1904 <- read.table("METABRIC_1904_data_expression.tab", row.names="Gene", header = TRUE, sep = "\t")
dim(data1904)
groups62 <- read.table("designmatrix_62.txt", header=TRUE, row.names="SampleID", sep = "\t" )
C4_62 <- factor(groups62$Genes62, levels=c("RED","BLU","GREEN","ORANGE"))
design_62 <- model.matrix(~0+C4_62)
colnames(design_62) <- c("RED","BLU","GREEN","ORANGE")
fit <- lmFit(data1904, design_62)
contrast.matrix <- makeContrasts(BLU-GREEN, BLU-ORANGE, BLU-RED, GREEN-RED, GREEN-ORANGE, RED-ORANGE, levels=design_62)
fit_62 <- contrasts.fit(fit, contrast.matrix)
fit_62 <- eBayes(fit_62)
results <- decideTests(fit_62)
summary(results)
> summary(results)
BLU - GREEN BLU - ORANGE BLU - RED GREEN - RED GREEN - ORANGE RED - ORANGE
Down 3926 4327 5170 4451 3726 684
NotSig 15402 14067 11195 13873 16039 23048
Up 5040 5974 8003 6044 4603 636
Now, the file designmatrix_62.txt is sorted alphabetically with respect to he color name descending (I have one column with the Sample ID an one column with the "Cluster Color" (There are 4 Colors, BLUE GREEN RED and ORANGE, here listed first ones ).:
SampleID Genes62
MB-0147 RED
MB-0167 RED
MB-0174 RED
MB-0238 RED
MB-0241 RED
MB-0346 RED
MB-0358 RED
MB-0371 RED
MB-0391 RED
MB-0393 RED
MB-0660 RED
MB-0882 RED
MB-0906 RED
MB-2796 RED
MB-2847 RED
MB-2922 RED
MB-3025 RED
MB-3028 RED
MB-3153 RED
MB-3470 RED
MB-3488 RED
.....
And now I sort the "color" column (Genes62) alphabetically ascending
SampleID Genes62
MB-0000 BLU
MB-0005 BLU
MB-0006 BLU
MB-0014 BLU
MB-0022 BLU
MB-0028 BLU
MB-0039 BLU
MB-0045 BLU
MB-0048 BLU
MB-0053 BLU
MB-0054 BLU
MB-0056 BLU
MB-0062 BLU
MB-0064 BLU
MB-0068 BLU
MB-0079 BLU
MB-0081 BLU
MB-0083 BLU
MB-0093 BLU
MB-0101 BLU
MB-0106 BLU
.....
and rerun the script AS IS
I get these results....
BLU - GREEN BLU - ORANGE BLU - RED GREEN - RED GREEN - ORANGE RED - ORANGE
Down 5591 6611 1176 4333 8289 6209
NotSig 13663 13054 22065 15197 9451 13383
Up 5114 4703 1127 4838 6628 4776
The only thing that change is the sorting of the table which is red at the beginning.
Sorted Descending:
> table(C4_62)
C4_62
RED BLU GREEN ORANGE
337 547 718 302
Sorted Ascending
> table(C4_62)
C4_62
RED BLU GREEN ORANGE
337 547 718 302
Exactly the same numbers, but different results... It seem as LIMMA associate the samples not by the Sample ID but randomly in 4 groups with the same number of samples in each category, depending how the list of samples is sorted
This does not make any sense, does it?
Any clue?
Thank you very much.
Could you please rganize your question in a way that one can easily understand what is going on. There is a code option (
101010
button) to highlight code and data. Right now I personally would be reluctant to even read the question because it is long and a bit unorganized. Eventually that helps getting good answers. Thanks!The question is: You change the sorting of the desigmatrix.txt file and you get different result running the same script.
Ciao Stefano, il tuo messaggio รจ incomprensibile. Puoi modificarlo e utilizzare il
101 010
Scusate... non avevo capito come funzionasse. any idea?
This is the code:
My question is simple. If you change the sorting of the file "designmatrix_62.txt" which contain 4 classes for my samples (BLU; RED, GREEN and ORANGE) the result of the Summary gives different results.
I made another test... in the data file (the METABRIC dataset) the tample ID is as such: MB-0000 if in the file "designmatrix_62.txt" I replace the MB with an MC, and now the sampleID is MC-0000 the analysis work as fine, with no error.. and gives the same results...
HOW IS IT POSSIBLE? The "designmatrix_62.txt" is:
How is it possible that the function
lmFit(data1904, design_62)
can actually work???
Thank you. Can you show the sorted and unsorted file for designmatrix_62.txt?
In both cases, what is the output of:
Moreover today I removed the ID of the samples from the file designmatrix_62.txt.. and the result are the same (without altering the sorting).
the file is simple, two column, tab delimited, one for the SampleID and one for the category: but since the samples are 1904, it is impossible to show it here.
we need before and after you perform the sorting. You indicated that the 'sorting' is what is resulting in a discrepancy.
As you see the model.matrix (design_62) does not care about SampleID but just the order in winch the parameters are presented/read.
It does not have to store it.
Before we go crazy here, can you please re-post a minimal example of data that permits that we can easily re-produce the behaviour that you observe?
The check, if any by limma, will be made by row and column names.
Try:
[145] "MB.0