Soo this question might sound stupid, but I have some trouble understanding how to interpret my logFC values.
Let's assume I have my normalized data from the affy chip (eset), a design matrix and want to identify differentially expressed genes between a Treated and a Control group:
contrastm <- makeContrasts(Treated - Control, levels = design)
fit <- lmFit(exprs(eset), design)
fit2 <- contrasts.fit(fit, contrastm)
fit2b <- eBayes(fit2)
Then I call topTable(fit2b, coef = 1)
and get a logFC value of -9.8 for a gene. This means my gene is 2^(-9.8)
fold downregulated in the Treated group, right?
So of course if I change the command makeContrasts to Control - Treated I get a logFC of 9.8, which is the value described in the publication I am working with... so which one is right?
Edit: Here is the current problem...:
Sorry for the delay, had a couple other things in between. So I ran the following script at home..It will download the GEOdataset (~115MB), untar it and create a annotation.txt in a new folder:
# libraries needed
library(Biobase)
library(GEOquery)
library(limma)
library(simpleaffy)
dir.create("BioStars_script")
setwd("BioStars_script/")
### This might take a while...
geoset <- getGEOSuppFiles("GSE28914")
### untar files in your current workfolder
untar("GSE28914/GSE28914_RAW.tar")
### Creates an Annotation.txt for read.affy
Annotation_dummy <- read.delim("GSE28914/filelist.txt")
### Sample infos according to the GEO profile, cel files should be ordered in increasing number
Annotation <- data.frame(File = Annotation_dummy$Name[-1],
Name = Annotation_dummy$Name[-1],
Target = c("intact", "acute", "POD3", "intact", "POD3",
"intact", "acute", "POD3", "intact", "acute",
"POD7", "intact", "acute", "POD7", "intact",
"POD3", "POD7", "intact", "acute", "POD3",
"POD7", "intact", "acute", "POD3", "POD7"))
write.table(Annotation, "Annotation.txt", quote = F, row.names = F, sep = "\t")
### Begin import
wound_data <- read.affy(covdesc = "Annotation.txt")
### RMA normalization
rma.set <- rma(wound_data)
### Filter probes, remove duplicates, etc
e.anig <- nsFilter(rma.set, require.entrez = T, remove.dupEntrez = T)
### Create Model
f <- factor(as.factor(rma.set$Target),
levels = levels(as.factor(rma.set$Target)))
design <- model.matrix(~0 + f)
colnames(design) <- levels(as.factor(c("acute", "intact", "POD3", "POD7")))
#### define the comparisons that you would like to do
contrast <- makeContrasts(d3 = POD3 - intact,
d7 = POD7 - intact,
d0 = acute - intact,
levels = design)
#### fit a linear model for each gene over all microarrays
fit <- lmFit(exprs(e.anig$eset), design)
### compute coefficients and standard errors for the defined contrasts
fit2 <- contrasts.fit(fit, contrast)
### compute statistics for differential gene expression (moderated t-test, and others)
fit2 <- eBayes(fit2)
### Top differentially expressed genes, 204475_at is MMP1 for example
topTable(fit2)
Now what really confuses me: I get the correct results as described in the paper! My work machine, running the same script however spits out different results... The only differences were:
- my work machine is blocked from using GEOquery, so I had to download the GEOdataset manually (checked this at home)
- the order of the
colnames(design)
at home is different than at work, so I adjusted the script accordingly
This is my sessionInfo()
at home:
R version 3.1.2 (2014-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8
[5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 LC_PAPER=de_DE.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] hgu133plus2.db_2.14.0 org.Hs.eg.db_2.14.0 RSQLite_1.0.0 DBI_0.3.1 AnnotationDbi_1.26.1
[6] GenomeInfoDb_1.0.2 hgu133plus2cdf_2.14.0 simpleaffy_2.40.0 gcrma_2.36.0 genefilter_1.46.1
[11] affy_1.42.3 limma_3.20.9 GEOquery_2.30.1 Biobase_2.24.0 BiocGenerics_0.10.0
loaded via a namespace (and not attached):
[1] affyio_1.32.0 annotate_1.42.1 BiocInstaller_1.14.3 Biostrings_2.32.1 IRanges_1.22.10
[6] preprocessCore_1.26.1 RCurl_1.95-4.3 splines_3.1.2 stats4_3.1.2 survival_2.37-7
[11] tools_3.1.2 XML_3.98-1.1 xtable_1.7-4 XVector_0.4.0 zlibbioc_1.10.0
At work I also use R 3.1.2 with all packages up to date, but a Windows-32bit machine. (I just checked earlier today and reran the script after updating...) Sessioninfo will follow in the morning.
Anyone got a clue what could be wrong?
You seem to have a typo, there is no
fit2.m
object. Do you meanfit2b
?you're right, didn't change the results though.
The data comes from this paper:
http://onlinelibrary.wiley.com/doi/10.1111/j.1524-475X.2012.00831.x/abstract
and has the GEO-Accession number: GSE28914
The examplatory Gene which I am using as a "benchmark" is MMP1, which is upregulated ~700fold ond day3 and ~500fold on day7 (Table 3)
This is the approach I used, I left the quality control part out:
- wound_data is the Data from the celfiles imported with read.affy, since i can't use the GEOquery package atm, I will add the import from GEO later so everyone can run it
MMP1 is the gene with the ProbeID 204475_at and I get a logFC of -8.75948 -9.220335 0.6801707 for d3, d7 and d0 respectively.
mheiser1 your code looks ok overall. If you add the import, it will help to run it.
thank you for checking, I updated the original post
How many probes are targeting MMP1 gene?
I did not get your question, what exactly is the confusion?
The paper states that MMP1 is upregulated by many fold in treatment vs. the control. This is not what the OP observed (MMP1 was found to be downregulated in treatment) when he/she ran the above code.
I just saw the method
Their results are expressed as means ± SEM. Statistical analysis was performed, using GeneSpring GX software.
But still results should not go another way around.
What I can say is, MMP1 is highly expressed in epithelial cells (RPM ~ 100-300) including skin, fibroblast or HUVEC cells.
Codes looks okay. If this is the only Probe mapped to MMP1 then OP should probably contact the authors of the paper.
Yes, I saw they were using a different normalizing approach (normalize over median), but this can not explain me getting opposite results. Hence my confusion.
MMP1 isn't the only one which gives unexpected results. If I "twist" the contrasts to get their numbers, Interleukin-1b for example is down regulated (among other inflammatory markers like IL-6), which is highly unlikely...
Guess I'll write them an email, see if they can shed some light on that, thanks for everyones input! I'll share their answer.
Did they take patient into account? I imagine that that could drastically change things. BTW, you can drop the contrast if you just make
intact
the base level off
.When you use Treated-Control, it should be FC = -891.4438 and for Control-Treated, FC = 891.4438. It is basically the same thing, only the direction changes. But you should always use Treated-Control which tells you if it is upregulated/downregulated in treatment with respect to the control. Isn't the aim always to find what changes happen in abnormal tissue versus normal tissue? So, in your case it is downregulated in your treatment samples by a FC of 891.4438.
I don't think the fold change can be negative...more like closer to zero. So I guess you agree with me that it should be downregulated according to my results? I'm going to post the full code (the dataset is in the GEO database) later, I kind of hope I made a silly mistake somewhere and not the authors of the paper...
What I meant to say is the last line of my comment. That particular gene is downregulated by a fold change of 891.44 in treatment samples as compared to the normals.
Sorry for the delay, had a couple other things in between. So I ran the following script at home..It will download the GEOdataset (~115MB), untar it and create a annotation.txt in a new folder:
Now what really confuses me: I get the correct results as described in the paper! My work machine, running the same script however spits out different results... The only differences were:
my work machine is blocked from using GEOquery, so I had to download the GEOdataset manually (checked this at home)
the order of the colnames(design) at home is different than at work, so I adjusted the script accordingly
This is my sessionInfo at home:
At work I also use R 3.1.2 with all packages up to date, but a Windows-32bit machine. (I just checked earlier today and reran the script after updating...) Sessioninfo will follow in the morning.
Anyone got a clue what could be wrong?
The good thing is at least now your results conform with those of the authors. Even with different versions of R packages or a different OS, you shouldn't be getting exactly opposite results, right? I am pretty sure something must be off with the code you ran previously.
I ran your script and got these results (Sure enough you are getting the same results):
And by the way I am using a very old version of R (a lot of my packages are not updated as well) on my personal laptop:
Okay, thank you for checking it on your machine again. I again checked that I named the columns of the design matrix correctly...
I just now decided to check my annotation file again and saw that I named the POD3 and POD7 samples 3POD and 7POD. If I change that in the Annotation file, everything works! I guess the number in the string somehow messes up the whole order of the factor?
Maybe your column names are reversed, please check it to make sure
The name in my Annotation file was (under Target) 3POD/7POD instead of POD3/POD7. After I changed that I get the correct results.