LIMMA and design matrix.
1
0
Entering edit mode
3.6 years ago

Hi, I'm using LIMMA for DE analysis on the METABRIC dataset. I divided the sample in four class, and I want to know which genes are deferentially expressed among each group. With RStudio I wrote this simple script:

d

data1904 <- read.table("METABRIC_1904_data_expression.tab", row.names="Gene", header = TRUE, sep = "\t")
        dim(data1904)
        groups62 <- read.table("designmatrix_62.txt", header=TRUE, row.names="SampleID", sep = "\t" )
        C4_62 <- factor(groups62$Genes62, levels=c("RED","BLU","GREEN","ORANGE")) 
        design_62 <- model.matrix(~0+C4_62)
        colnames(design_62) <- c("RED","BLU","GREEN","ORANGE")
        fit <- lmFit(data1904, design_62)
        contrast.matrix <- makeContrasts(BLU-GREEN, BLU-ORANGE, BLU-RED, GREEN-RED, GREEN-ORANGE, RED-ORANGE, levels=design_62)

        fit_62 <- contrasts.fit(fit, contrast.matrix)
        fit_62 <- eBayes(fit_62)
        results <- decideTests(fit_62)
        summary(results)



> summary(results)
BLU - GREEN BLU - ORANGE BLU - RED GREEN - RED GREEN - ORANGE RED - ORANGE
Down          3926         4327      5170        4451           3726          684
NotSig       15402        14067     11195       13873          16039        23048
Up            5040         5974      8003        6044           4603          636

Now, the file designmatrix_62.txt is sorted alphabetically with respect to he color name descending (I have one column with the Sample ID an one column with the "Cluster Color" (There are 4 Colors, BLUE GREEN RED and ORANGE, here listed first ones ).:

SampleID    Genes62
MB-0147 RED
MB-0167 RED
MB-0174 RED
MB-0238 RED
MB-0241 RED
MB-0346 RED
MB-0358 RED
MB-0371 RED
MB-0391 RED
MB-0393 RED
MB-0660 RED
MB-0882 RED
MB-0906 RED
MB-2796 RED
MB-2847 RED
MB-2922 RED
MB-3025 RED
MB-3028 RED
MB-3153 RED
MB-3470 RED
MB-3488 RED
.....

And now I sort the "color" column (Genes62) alphabetically ascending

  SampleID  Genes62
    MB-0000 BLU
    MB-0005 BLU
    MB-0006 BLU
    MB-0014 BLU
    MB-0022 BLU
    MB-0028 BLU
    MB-0039 BLU
    MB-0045 BLU
    MB-0048 BLU
    MB-0053 BLU
    MB-0054 BLU
    MB-0056 BLU
    MB-0062 BLU
    MB-0064 BLU
    MB-0068 BLU
    MB-0079 BLU
    MB-0081 BLU
    MB-0083 BLU
    MB-0093 BLU
    MB-0101 BLU
    MB-0106 BLU
    .....

and rerun the script AS IS

I get these results....

BLU - GREEN BLU - ORANGE BLU - RED GREEN - RED GREEN - ORANGE RED - ORANGE
Down          5591         6611      1176        4333           8289         6209
NotSig       13663        13054     22065       15197           9451        13383
Up            5114         4703      1127        4838           6628         4776

The only thing that change is the sorting of the table which is red at the beginning.

Sorted Descending:

> table(C4_62)
C4_62
RED    BLU  GREEN ORANGE 
 337    547    718    302

Sorted Ascending

> table(C4_62)
C4_62
RED    BLU  GREEN ORANGE 
337    547    718    302

Exactly the same numbers, but different results... It seem as LIMMA associate the samples not by the Sample ID but randomly in 4 groups with the same number of samples in each category, depending how the list of samples is sorted

This does not make any sense, does it?

Any clue?

Thank you very much.

R software error • 2.0k views
ADD COMMENT
0
Entering edit mode

Could you please rganize your question in a way that one can easily understand what is going on. There is a code option (101010 button) to highlight code and data. Right now I personally would be reluctant to even read the question because it is long and a bit unorganized. Eventually that helps getting good answers. Thanks!

ADD REPLY
0
Entering edit mode
 data1904 <- read.table("METABRIC_1904_data_expression.tab", row.names="Gene", header = TRUE, sep = "\t")
dim(data1904)

groups62 <- read.table("designmatrix_62.txt", header=TRUE, row.names="SampleID", sep = "\t" )

C4_62 <- factor(groups62$Genes62, levels=c("RED","BLU","GREEN","ORANGE")) 

design_62 <- model.matrix(~0+C4_62)

colnames(design_62) <- c("RED","BLU","GREEN","ORANGE")

fit <- lmFit(data1904, design_62)

contrast.matrix <- makeContrasts(BLU-GREEN, BLU-ORANGE, BLU-RED, GREEN-RED, GREEN-ORANGE, RED-ORANGE, levels=design_62)

fit_62 <- contrasts.fit(fit, contrast.matrix)

fit_62 <- eBayes(fit_62)

results <- decideTests(fit_62)

summary(results)

The question is: You change the sorting of the desigmatrix.txt file and you get different result running the same script.

ADD REPLY
0
Entering edit mode

Ciao Stefano, il tuo messaggio รจ incomprensibile. Puoi modificarlo e utilizzare il 101 010

ADD REPLY
0
Entering edit mode

Scusate... non avevo capito come funzionasse. any idea?

ADD REPLY
0
Entering edit mode

This is the code:

library(stats)
library(gplots)
library(limma)
library(edgeR)
library(Glimma)
library(RColorBrewer)


setwd("D:/Projects/BIOINFORMATICS/ENDOMATRIX/METABRIC")

data1904 <- read.table("METABRIC_1904_data_expression.tab", row.names="Gene", header = TRUE, sep = "\t") 
dim(data1904)

groups62 <- read.delim("designmatrix_62.txt")
dim(groups62)

C4_62 <- factor(groups62$Genes62, levels=c("RED","BLU","GREEN","ORANGE")) 
design_62 <- model.matrix(~0+C4_62, data= data1904)

colnames(design_62) <- c("RED","BLU","GREEN","ORANGE")

fit <- lmFit(data1904, design_62)
contrast.matrix <- makeContrasts(BLU-GREEN, BLU-ORANGE, BLU-RED, GREEN-RED, GREEN-ORANGE, RED-ORANGE, levels=design_62)


fit_62 <- contrasts.fit(fit, contrast.matrix)
fit_62 <- eBayes(fit_62)
results <- decideTests(fit_62) #any fc
summary(results)

My question is simple. If you change the sorting of the file "designmatrix_62.txt" which contain 4 classes for my samples (BLU; RED, GREEN and ORANGE) the result of the Summary gives different results.

I made another test... in the data file (the METABRIC dataset) the tample ID is as such: MB-0000 if in the file "designmatrix_62.txt" I replace the MB with an MC, and now the sampleID is MC-0000 the analysis work as fine, with no error.. and gives the same results...

HOW IS IT POSSIBLE? The "designmatrix_62.txt" is:

SampleID    Genes62
MC-0000    BLU
MC-0002    GREEN
MC-0005    BLU
MC-0006    BLU
MC-0008    ORANGE
MC-0010    ORANGE
MC-0014    BLU


The database ID are:
SampleID    
MB-0000
MB-0002
MB-0005
MB-0006
MB-0008
MB-0010
MB-0014

How is it possible that the function
lmFit(data1904, design_62)

can actually work???

ADD REPLY
0
Entering edit mode

Thank you. Can you show the sorted and unsorted file for designmatrix_62.txt?

In both cases, what is the output of:

dim(groups62)
str(groups62)
str(design_62)
ADD REPLY
0
Entering edit mode
> dim(groups62)
[1] 1904    2
> str(design_62)
 num [1:1904, 1:4] 0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:1904] "1" "2" "3" "4" ...
  ..$ : chr [1:4] "RED" "BLU" "GREEN" "ORANGE"
 - attr(*, "assign")= int [1:4] 1 1 1 1
 - attr(*, "contrasts")=List of 1
  ..$ C4_62: chr "contr.treatment"
> str(groups62)
'data.frame':   1904 obs. of  2 variables:
 $ SampleID: chr  "MB-0000" "MB-0002" "MB-0005" "MB-0006" ...
 $ Genes62 : chr  "BLU" "GREEN" "BLU" "BLU" ...

Moreover today I removed the ID of the samples from the file designmatrix_62.txt.. and the result are the same (without altering the sorting).

the file is simple, two column, tab delimited, one for the SampleID and one for the category: but since the samples are 1904, it is impossible to show it here.

ADD REPLY
0
Entering edit mode

we need before and after you perform the sorting. You indicated that the 'sorting' is what is resulting in a discrepancy.

ADD REPLY
0
Entering edit mode
    Example 1:
    designmatrix_62.txt:

    SampleID    Genes62
    MB-0000 BLU
    MB-0002 GREEN
    MB-0005 BLU
    MB-0006 BLU
    MB-0008 ORANGE
    MB-0010 ORANGE
    MB-0014 BLU
    MB-0022 BLU
    MB-0028 BLU
    ...

    > groups62 <- read.delim("designmatrix_62.txt") 
    > dim(groups62)
    [1] 1904    2
    > C4_62 <- factor(groups62$Genes62, levels=c("RED","BLU","GREEN","ORANGE")) 
    > design_62 <- model.matrix(~0+C4_62, data = data1904)
    > colnames(design_62) <- c("RED","BLU","GREEN","ORANGE")
    > fit <- lmFit(data1904, design_62)
    > contrast.matrix <- makeContrasts(BLU-GREEN, BLU-ORANGE, BLU-RED, GREEN-RED, GREEN-ORANGE, RED-ORANGE, levels=design_62)
    > fit_62 <- contrasts.fit(fit, contrast.matrix)
    > fit_62 <- eBayes(fit_62)
    > results2 <- decideTests(fit_62, adjust="BH", p.value=0.05, lfc = 0.584962501) 
    > summary(results2)
           BLU - GREEN BLU - ORANGE BLU - RED GREEN - RED GREEN - ORANGE RED - ORANGE
    Down             7          512       566         173            635          880
    NotSig       24211        23339     22780       23769          23207        23050
    Up             150          517      1022         426            526          438




----------


    Example 2:
    designmatrix_62.txt:

    Genes62
    BLU
    GREEN
    BLU
    BLU
    ORANGE
    ORANGE
    BLU
    BLU
    BLU
    ...

    > groups62 <- read.delim("designmatrix_62.txt") 
    > dim(groups62)
    [1] 1904    1
    > C4_62 <- factor(groups62$Genes62, levels=c("RED","BLU","GREEN","ORANGE")) 
    > design_62 <- model.matrix(~0+C4_62, data = data1904)
    > colnames(design_62) <- c("RED","BLU","GREEN","ORANGE")
    > fit <- lmFit(data1904, design_62)
    > contrast.matrix <- makeContrasts(BLU-GREEN, BLU-ORANGE, BLU-RED, GREEN-RED, GREEN-ORANGE, RED-ORANGE, levels=design_62)
    > fit_62 <- contrasts.fit(fit, contrast.matrix)
    > fit_62 <- eBayes(fit_62)
    > results2 <- decideTests(fit_62, adjust="BH", p.value=0.05, lfc = 0.584962501) 
    > summary(results2)
           BLU - GREEN BLU - ORANGE BLU - RED GREEN - RED GREEN - ORANGE RED - ORANGE
    Down             7          512       566         173            635          880
    NotSig       24211        23339     22780       23769          23207        23050
    Up             150          517      1022         426            526          438
ADD REPLY
0
Entering edit mode

As you see the model.matrix (design_62) does not care about SampleID but just the order in winch the parameters are presented/read.

ADD REPLY
0
Entering edit mode

It does not have to store it.

Before we go crazy here, can you please re-post a minimal example of data that permits that we can easily re-produce the behaviour that you observe?

ADD REPLY
0
Entering edit mode

How is it possible that the function

lmFit(data1904, design_62)

can actually work???

The check, if any by limma, will be made by row and column names.

Try:

colnames(data1904)
ADD REPLY
0
Entering edit mode
> colnames(data1904)
   [1] "MB.0000" "MB.0002" "MB.0005" "MB.0006" "MB.0008" "MB.0010" "MB.0014" "MB.0022" "MB.0028"
  [10] "MB.0035" "MB.0036" "MB.0039" "MB.0045" "MB.0046" "MB.0048" "MB.0050" "MB.0053" "MB.0054"
  [19] "MB.0056" "MB.0060" "MB.0062" "MB.0064" "MB.0066" "MB.0068" "MB.0079" "MB.0081" "MB.0083"
  [28] "MB.0093" "MB.0095" "MB.0097" "MB.0100" "MB.0101" "MB.0102" "MB.0106" "MB.0107" "MB.0108"
  [37] "MB.0109" "MB.0111" "MB.0112" "MB.0113" "MB.0114" "MB.0115" "MB.0116" "MB.0117" "MB.0119"
  [46] "MB.0120" "MB.0121" "MB.0122" "MB.0123" "MB.0124" "MB.0125" "MB.0126" "MB.0127" "MB.0128"
  [55] "MB.0129" "MB.0130" "MB.0131" "MB.0133" "MB.0134" "MB.0135" "MB.0136" "MB.0138" "MB.0139"
  [64] "MB.0140" "MB.0142" "MB.0143" "MB.0144" "MB.0145" "MB.0146" "MB.0147" "MB.0148" "MB.0149"
  [73] "MB.0150" "MB.0151" "MB.0152" "MB.0154" "MB.0155" "MB.0157" "MB.0162" "MB.0163" "MB.0164"
  [82] "MB.0165" "MB.0166" "MB.0167" "MB.0168" "MB.0170" "MB.0171" "MB.0172" "MB.0173" "MB.0174"
  [91] "MB.0175" "MB.0176" "MB.0177" "MB.0178" "MB.0179" "MB.0180" "MB.0181" "MB.0184" "MB.0188"
 [100] "MB.0189" "MB.0191" "MB.0192" "MB.0193" "MB.0194" "MB.0195" "MB.0197" "MB.0198" "MB.0199"
 [109] "MB.0200" "MB.0201" "MB.0202" "MB.0203" "MB.0204" "MB.0205" "MB.0206" "MB.0207" "MB.0209"
 [118] "MB.0211" "MB.0214" "MB.0215" "MB.0218" "MB.0220" "MB.0221" "MB.0222" "MB.0223" "MB.0224"
 [127] "MB.0225" "MB.0226" "MB.0227" "MB.0228" "MB.0229" "MB.0230" "MB.0231" "MB.0232" "MB.0233"
 [136] "MB.0234" "MB.0235" "MB.0236" "MB.0238" "MB.0239" "MB.0241" "MB.0242" "MB.0243" "MB.0244"

[145] "MB.0

ADD REPLY
1
Entering edit mode
3.6 years ago

Your assumption seems to be that limma will automatically keep your design table and expression data in synchronicity. In fact, you, as the analyst, must ensure this.

What is happening is that, when you re-order your design table, you need to always also re-order your expression data (specifically, the columns).

To clarify: The length and order of the rows in your design must be 'perfectly' equal to the column names of your expression data.

In your case, this following command should return TRUE (if not, there is a problem):

all(groups62$SampleID == colnames(data1904))

Kevin

ADD COMMENT
0
Entering edit mode

Well... now is clear to me... The order of the samples and the order of the SampleID in the design matrix MUST match. So you do not need the SampleID at all, or better LIMMA do not care about the SampleID but just the order of the data. OK.

Yes, in my opinion is an obvious assumption that if a program need a list with some parameters linked to the sample name, this information is used and retained.

I use routinely EdgeR for NGS analysis and the procedure is the same.. but if you supply a list of sample with the classification, that information is stored and link to the expression data in a dataframe. With LIMMA this is not the case. This is something that should be mention in the user guide, though...

Anyway.. thank you for your kind words and patient.

Stefano

ADD REPLY
1
Entering edit mode

Prego! prego!

ADD REPLY

Login before adding your answer.

Traffic: 3277 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6