[R] Microarray analysis interpreting logFC after makeContrasts
0
0
Entering edit mode
7.9 years ago
mheiser1 ▴ 10

Soo this question might sound stupid, but I have some trouble understanding how to interpret my logFC values.

Let's assume I have my normalized data from the affy chip (eset), a design matrix and want to identify differentially expressed genes between a Treated and a Control group:

contrastm <- makeContrasts(Treated - Control, levels = design)
fit <- lmFit(exprs(eset), design)
fit2 <- contrasts.fit(fit, contrastm)
fit2b <- eBayes(fit2)


Then I call topTable(fit2b, coef = 1) and get a logFC value of -9.8 for a gene. This means my gene is 2^(-9.8) fold downregulated in the Treated group, right?

So of course if I change the command makeContrasts to Control - Treated I get a logFC of 9.8, which is the value described in the publication I am working with... so which one is right?

Edit: Here is the current problem...:

Sorry for the delay, had a couple other things in between. So I ran the following script at home..It will download the GEOdataset (~115MB), untar it and create a annotation.txt in a new folder:

# libraries needed
library(Biobase)
library(GEOquery)
library(limma)
library(simpleaffy)

dir.create("BioStars_script")
setwd("BioStars_script/")
### This might take a while...
geoset <- getGEOSuppFiles("GSE28914")
### untar files in your current workfolder
untar("GSE28914/GSE28914_RAW.tar")

### Creates an Annotation.txt for read.affy

### Sample infos according to the GEO profile, cel files should be ordered in increasing number
Annotation <- data.frame(File = Annotation_dummy$Name[-1], Name = Annotation_dummy$Name[-1],
Target = c("intact", "acute", "POD3", "intact", "POD3",
"intact", "acute", "POD3", "intact", "acute",
"POD7", "intact", "acute", "POD7", "intact",
"POD3", "POD7", "intact", "acute", "POD3",
"POD7", "intact", "acute", "POD3", "POD7"))
write.table(Annotation, "Annotation.txt", quote = F, row.names = F, sep = "\t")

### Begin import
### RMA normalization
rma.set <- rma(wound_data)

### Filter probes, remove duplicates, etc
e.anig <- nsFilter(rma.set, require.entrez = T, remove.dupEntrez = T)

### Create Model
f <- factor(as.factor(rma.set$Target), levels = levels(as.factor(rma.set$Target)))
design <- model.matrix(~0 + f)
colnames(design) <- levels(as.factor(c("acute", "intact", "POD3", "POD7")))

#### define the comparisons that you would like to do
contrast <- makeContrasts(d3 = POD3 - intact,
d7 = POD7 - intact,
d0 = acute - intact,
levels = design)

#### fit a linear model for each gene over all microarrays
fit <- lmFit(exprs(e.anig$eset), design) ### compute coefficients and standard errors for the defined contrasts fit2 <- contrasts.fit(fit, contrast) ### compute statistics for differential gene expression (moderated t-test, and others) fit2 <- eBayes(fit2) ### Top differentially expressed genes, 204475_at is MMP1 for example topTable(fit2)  Now what really confuses me: I get the correct results as described in the paper! My work machine, running the same script however spits out different results... The only differences were: • my work machine is blocked from using GEOquery, so I had to download the GEOdataset manually (checked this at home) • the order of the colnames(design) at home is different than at work, so I adjusted the script accordingly This is my sessionInfo() at home: R version 3.1.2 (2014-10-31) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 LC_PAPER=de_DE.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] hgu133plus2.db_2.14.0 org.Hs.eg.db_2.14.0 RSQLite_1.0.0 DBI_0.3.1 AnnotationDbi_1.26.1 [6] GenomeInfoDb_1.0.2 hgu133plus2cdf_2.14.0 simpleaffy_2.40.0 gcrma_2.36.0 genefilter_1.46.1 [11] affy_1.42.3 limma_3.20.9 GEOquery_2.30.1 Biobase_2.24.0 BiocGenerics_0.10.0 loaded via a namespace (and not attached): [1] affyio_1.32.0 annotate_1.42.1 BiocInstaller_1.14.3 Biostrings_2.32.1 IRanges_1.22.10 [6] preprocessCore_1.26.1 RCurl_1.95-4.3 splines_3.1.2 stats4_3.1.2 survival_2.37-7 [11] tools_3.1.2 XML_3.98-1.1 xtable_1.7-4 XVector_0.4.0 zlibbioc_1.10.0  At work I also use R 3.1.2 with all packages up to date, but a Windows-32bit machine. (I just checked earlier today and reran the script after updating...) Sessioninfo will follow in the morning. Anyone got a clue what could be wrong? microarray R • 6.0k views ADD COMMENT 1 Entering edit mode You seem to have a typo, there is no fit2.m object. Do you mean fit2b? ADD REPLY 0 Entering edit mode you're right, didn't change the results though. ADD REPLY 1 Entering edit mode The data comes from this paper: http://onlinelibrary.wiley.com/doi/10.1111/j.1524-475X.2012.00831.x/abstract and has the GEO-Accession number: GSE28914 The examplatory Gene which I am using as a "benchmark" is MMP1, which is upregulated ~700fold ond day3 and ~500fold on day7 (Table 3) This is the approach I used, I left the quality control part out: - wound_data is the Data from the celfiles imported with read.affy, since i can't use the GEOquery package atm, I will add the import from GEO later so everyone can run it library(affy) library(affyPLM) library(affycoretools) library(limma) ### RMA normalization rma.set <- rma(wound_data) ### Get the expression set e <- exprs(rma.set) ### Filter probes e.anig <- nsFilter(rma.set, require.entrez = T, remove.dupEntrez = T) ### Create Model f <- factor(as.factor(rma.set$Target),
levels = levels(as.factor(rma.set$Target))) design <- model.matrix(~0 + f) colnames(design) <- levels(as.factor(c("POD3", "POD7", "Acute", "intact"))) #### define the comparisons that you would like to do contrast <- makeContrasts(d3 = POD3 - intact, d7 = POD7 - intact, d0 = Acute - intact, levels = design) #### fit a linear model for each gene over all microarrays fit <- lmFit(exprs(e.anig$eset), design)
### compute coefficients and standard errors for the defined contrasts
fit2 <- contrasts.fit(fit, contrast)
### compute statistics for differential gene expression (moderated t-test, and others)
fit2 <- eBayes(fit2)
### Top differentially expressed genes
topTable(fit2)

MMP1 is the gene with the ProbeID 204475_at and I get a logFC of -8.75948 -9.220335 0.6801707 for d3, d7 and d0 respectively.

1
Entering edit mode

mheiser1 your code looks ok overall. If you add the import, it will help to run it.

0
Entering edit mode

thank you for checking, I updated the original post

0
Entering edit mode

How many probes are targeting MMP1 gene?

I did not get your question, what exactly is the confusion?

1
Entering edit mode

The paper states that MMP1 is upregulated by many fold in treatment vs. the control. This is not what the OP observed (MMP1 was found to be downregulated in treatment) when he/she ran the above code.

1
Entering edit mode

I just saw the method

Their results are expressed as means ± SEM. Statistical analysis was performed, using GeneSpring GX software.

But still results should not go another way around.

What I can say is, MMP1 is highly expressed in epithelial cells (RPM ~ 100-300) including skin, fibroblast or HUVEC cells.

Codes looks okay. If this is the only Probe mapped to MMP1 then OP should probably contact the authors of the paper.

0
Entering edit mode

Yes, I saw they were using a different normalizing approach (normalize over median), but this can not explain me getting opposite results. Hence my confusion.

MMP1 isn't the only one which gives unexpected results. If I "twist" the contrasts to get their numbers, Interleukin-1b for example is down regulated (among other inflammatory markers like IL-6), which is highly unlikely...

Guess I'll write them an email, see if they can shed some light on that, thanks for everyones input! I'll share their answer.

0
Entering edit mode

Did they take patient into account? I imagine that that could drastically change things. BTW, you can drop the contrast if you just make intact the base level of f.

0
Entering edit mode

When you use Treated-Control, it should be FC = -891.4438 and for Control-Treated, FC = 891.4438. It is basically the same thing, only the direction changes. But you should always use Treated-Control which tells you if it is upregulated/downregulated in treatment with respect to the control. Isn't the aim always to find what changes happen in abnormal tissue versus normal tissue? So, in your case it is downregulated in your treatment samples by a FC of 891.4438.

0
Entering edit mode

I don't think the fold change can be negative...more like closer to zero. So I guess you agree with me that it should be downregulated according to my results? I'm going to post the full code (the dataset is in the GEO database) later, I kind of hope I made a silly mistake somewhere and not the authors of the paper...

0
Entering edit mode

What I meant to say is the last line of my comment. That particular gene is downregulated by a fold change of 891.44 in treatment samples as compared to the normals.

0
Entering edit mode

Sorry for the delay, had a couple other things in between. So I ran the following script at home..It will download the GEOdataset (~115MB), untar it and create a annotation.txt in a new folder:

# libraries needed
library(Biobase)
library(GEOquery)
library(limma)
library(simpleaffy)

dir.create("BioStars_script")
setwd("BioStars_script/")
### This might take a while...
geoset <- getGEOSuppFiles("GSE28914")
### untar files in your current workfolder
untar("GSE28914/GSE28914_RAW.tar")

### Creates an Annotation.txt for read.affy

### Sample infos according to the GEO profile, cel files should be ordered in increasing number
Annotation <- data.frame(File = Annotation_dummy$Name[-1], Name = Annotation_dummy$Name[-1],
Target = c("intact", "acute", "POD3", "intact", "POD3",
"intact", "acute", "POD3", "intact", "acute",
"POD7", "intact", "acute", "POD7", "intact",
"POD3", "POD7", "intact", "acute", "POD3",
"POD7", "intact", "acute", "POD3", "POD7"))
write.table(Annotation, "Annotation.txt", quote = F, row.names = F, sep = "\t")

### Begin import
### RMA normalization
rma.set <- rma(wound_data)

### Filter probes, remove duplicates, etc
e.anig <- nsFilter(rma.set, require.entrez = T, remove.dupEntrez = T)

### Create Model
f <- factor(as.factor(rma.set$Target), levels = levels(as.factor(rma.set$Target)))
design <- model.matrix(~0 + f)
colnames(design) <- levels(as.factor(c("acute", "intact", "POD3", "POD7")))

#### define the comparisons that you would like to do
contrast <- makeContrasts(d3 = POD3 - intact,
d7 = POD7 - intact,
d0 = acute - intact,
levels = design)

#### fit a linear model for each gene over all microarrays
fit <- lmFit(exprs(e.anig\$eset), design)
### compute coefficients and standard errors for the defined contrasts
fit2 <- contrasts.fit(fit, contrast)
### compute statistics for differential gene expression (moderated t-test, and others)
fit2 <- eBayes(fit2)
### Top differentially expressed genes, 204475_at is MMP1 for example
topTable(fit2)


Now what really confuses me: I get the correct results as described in the paper! My work machine, running the same script however spits out different results... The only differences were:

• my work machine is blocked from using GEOquery, so I had to download the GEOdataset manually (checked this at home)

• the order of the colnames(design) at home is different than at work, so I adjusted the script accordingly

This is my sessionInfo at home:

R version 3.1.2 (2014-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8
[5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=de_DE.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] hgu133plus2.db_2.14.0 org.Hs.eg.db_2.14.0   RSQLite_1.0.0         DBI_0.3.1             AnnotationDbi_1.26.1
[6] GenomeInfoDb_1.0.2    hgu133plus2cdf_2.14.0 simpleaffy_2.40.0     gcrma_2.36.0          genefilter_1.46.1
[11] affy_1.42.3           limma_3.20.9          GEOquery_2.30.1       Biobase_2.24.0        BiocGenerics_0.10.0

loaded via a namespace (and not attached):
[1] affyio_1.32.0         annotate_1.42.1       BiocInstaller_1.14.3  Biostrings_2.32.1     IRanges_1.22.10
[6] preprocessCore_1.26.1 RCurl_1.95-4.3        splines_3.1.2         stats4_3.1.2          survival_2.37-7
[11] tools_3.1.2           XML_3.98-1.1          xtable_1.7-4          XVector_0.4.0         zlibbioc_1.10.0


At work I also use R 3.1.2 with all packages up to date, but a Windows-32bit machine. (I just checked earlier today and reran the script after updating...) Sessioninfo will follow in the morning.

Anyone got a clue what could be wrong?

1
Entering edit mode

The good thing is at least now your results conform with those of the authors. Even with different versions of R packages or a different OS, you shouldn't be getting exactly opposite results, right? I am pretty sure something must be off with the code you ran previously.

I ran your script and got these results (Sure enough you are getting the same results):

                     ID        d3        d7         d0   AveExpr        F      P.Value
204475_at     204475_at  9.900506  9.220335  0.4608554  7.712568 176.7406 2.002251e-16
204475_at   2.886674e-13


And by the way I am using a very old version of R (a lot of my packages are not updated as well) on my personal laptop:

> sessionInfo()

R version 2.15.2 (2012-10-26)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] hgu133plus2.db_2.8.0  org.Hs.eg.db_2.8.0    RSQLite_0.11.2
[4] DBI_0.2-5             hgu133plus2cdf_2.11.0 AnnotationDbi_1.20.7
[7] simpleaffy_2.34.0     gcrma_2.30.0          genefilter_1.40.0
[10] affy_1.36.1           GEOquery_2.24.1       BiocInstaller_1.8.3
[13] limma_3.14.4          Biobase_2.18.0        BiocGenerics_0.4.0

loaded via a namespace (and not attached):
[1] affyio_1.26.0         annotate_1.36.0       Biostrings_2.26.3
[4] IRanges_1.16.6        parallel_2.15.2       preprocessCore_1.20.0
[7] RCurl_1.95-4.1        splines_2.15.2        stats4_2.15.2
[10] survival_2.37-7       tools_2.15.2          XML_3.96-1.1
[13] xtable_1.7-1          zlibbioc_1.4.0

0
Entering edit mode

Okay, thank you for checking it on your machine again. I again checked that I named the columns of the design matrix correctly...

I just now decided to check my annotation file again and saw that I named the POD3 and POD7 samples 3POD and 7POD. If I change that in the Annotation file, everything works! I guess the number in the string somehow messes up the whole order of the factor?

1
Entering edit mode

Maybe your column names are reversed, please check it to make sure

0
Entering edit mode

The name in my Annotation file was (under Target) 3POD/7POD instead of POD3/POD7. After I changed that I get the correct results.

0
Entering edit mode
Congratulations, finally you got it