Differential gene Analysis by Limma
1
3
Entering edit mode
6.2 years ago
Mo ▴ 920

 

I have tired to find Up and down regulated based on Limma in R.

What I have done is as follows: 

library(limma)
fit <- lmFit(data, design)
#  Moderated t-statistic
fit2 <- eBayes(fit)
t<- topTable(fit2)
results <- decideTests(fit2)
summary(results)
vennDiagram(results)
plotMA(fit2, 2)
volcanoplot(fit2)

Now I have the following output 

                    x1            x2    AveExpr        F      P.Value    adj.P.Val
205476_at    1.0335848  2.777778e-06  0.9319210 31.17153 6.807618e-14 1.516533e-09
201110_s_at  1.9916715 -3.703704e-06  1.7957690 29.82112 2.446732e-13 2.725293e-09
212158_at    1.0642415 -3.703704e-06  0.9595617 28.11326 1.239096e-12 9.201115e-09
205239_at    1.5099863  9.259259e-07  1.3614631 26.50244 5.747837e-12 3.201114e-08
212154_at    0.8753344 -4.629630e-06  0.7892355 25.99093 9.364945e-12 4.172458e-08
214211_at    0.9923502 -2.136625e-17  0.8947420 24.90243 2.650229e-11 8.014586e-08
213988_s_at  0.7521288 -9.259259e-07  0.6781488 24.83824 2.818056e-11 8.014586e-08
200748_s_at  0.8140374 -3.703704e-06  0.7339678 24.76978 3.008813e-11 8.014586e-08
201109_s_at  1.4810858  2.777778e-06  1.3354055 24.69309 3.237926e-11 8.014586e-08
202444_s_at -0.7800713 -2.059028e-01 -0.7235957 24.49669 3.907505e-11 8.704748e-08

How can I now find which genes are up and which ones are down regulated ? 

if I perform without my design , I will get something like below 

fit <- lmFit(data)

and the rest as above 

t
                 logFC    AveExpr         t      P.Value    adj.P.Val        B
205476_at    0.9319210  0.9319210  7.479942 1.518325e-13 3.382372e-09 19.96770
201110_s_at  1.7957690  1.7957690  7.317002 4.884719e-13 5.440844e-09 18.82518
212158_at    0.9595617  0.9595617  7.105471 2.153154e-12 1.598861e-08 17.37602
205239_at    1.3614631  1.3614631  6.899900 8.775398e-12 4.887239e-08 16.00487
212154_at    0.7892355  0.7892355  6.833298 1.372724e-11 6.075399e-08 15.56854
202444_s_at -0.7235957 -0.7235957 -6.806994 1.636324e-11 6.075399e-08 15.39729
214211_at    0.8947420  0.8947420  6.689331 3.564056e-11 9.541769e-08 14.63867
213988_s_at  0.6781488  0.6781488  6.680742 3.770674e-11 9.541769e-08 14.58377
200748_s_at  0.7339678  0.7339678  6.671567 4.004358e-11 9.541769e-08 14.52520
201109_s_at  1.3354055  1.3354055  6.661281 4.283238e-11 9.541769e-08 14.45962

I have 1000 controlled and 100 untreated samples , while I have over 30000 of probes. 

I made my design like this 

design <- t(xor(t(replicate(2, c(rep(TRUE, 1000), rep(FALSE, 100)))), c(0, 1))) * 1L 
R Limma Microarray • 13k views
ADD COMMENT
1
Entering edit mode
  1. It's highly unlikely that topTable() with the options you showed produced that output.
  2. You're unlikely to want coef=1.
ADD REPLY
0
Entering edit mode

Yes, I updated the question 

ADD REPLY
0
Entering edit mode

That's still highly unlikely to be the topTable() command that produced that output.

ADD REPLY
0
Entering edit mode

@ Devon Ryan I just repeated the procedure and this is what I got , I am sure of it. why is it surprising ? 

By the way when i try to make 

cont.matrix <- makeContrasts(UnvsCO=untreated-controlled, levels=design)

I get some error  like 

Error in if (levels[1] == "(Intercept)") { : argument is of length zero

ADD REPLY
0
Entering edit mode

topTable() doesn't normally produce a column of F values, but of T values. Also, it labels logFC and whatever the column to the right of that is, which is usually not present.

Regarding the error with makeContrasts(), it'd be rather helpful to know what design looks like...

ADD REPLY
0
Entering edit mode
6.2 years ago
TriS ★ 4.3k

can you post a snapshot of your data? how many samples in each condition? even more importantly, how did you build your design matrix? that will help us helping you tweaking the code

ADD COMMENT
0
Entering edit mode

@TriS ,  @Devon Ryan, Please have a look at above , I updated my question. 

ADD REPLY
2
Entering edit mode

"1000 controlled and 100 untreated samples", do you mean 1000 treated/tumor/disease/whatever and 100 controls?

when you call topTable you should use

topTable(fit2, coef=2)

the coef=2 will tell you which comparisons you are making. in case of your design matrix, if you have 100 controls and 1k tumor/disease/treated/etc.. then you will see logFC > 0 and logFC < 0 for those higher and lower in the control, respectively.

if you want the positive fold changes to reflect higher gene expression in your treated samples, you have to invert the 1s and 0s in the second column of your design matrix, but it all depends which ones are the controls.

also, to extract the genes that are up/down regulated you can do the following:

t <- topTable(fit2, coef=2, n=inf)

results <- t[which(abs(t$logFC) > 1.5 & t$adj.P.Val < 0.05),]

this will create a matrix results containing genes with a log2 FC > +/-1.5 and adjusted p.values (FDR) < 0.05

ADD REPLY
0
Entering edit mode

@TriS , You mean I must not use the design to be able to get the logFC values ? 

If I use coef=2 , I get such error 

topTable(fit2, coef=2)
Error in fit$coefficients[, coef] : subscript out of bounds
ADD REPLY
2
Entering edit mode

you must use the design matrix, otherwise limma doesn't know which ones are your controls and which ones are your treated. looks like the size of your design matrix is off = is not the correct size.

the number of rows in the design matrix should be equal to the number of columns in the data matrix.

can you run dim(your_data_matrix) to see how big it is?

ADD REPLY
0
Entering edit mode

@TriS Thanks for your message! Yes I only want for now to find those up and down regulated genes . However, I get error using your code and I have been searching to solve it but I could not. The error is as follows: 

 t <- topTable(fit2, coef=2, n=inf)
Error in toptable(fit = fit[c("coefficients", "stdev.unscaled")], coef = coef,  : 
  object 'inf' not found​
ADD REPLY
2
Entering edit mode

You want Inf rather than inf.

ADD REPLY
0
Entering edit mode

@ Devon Ryan By setting the Inf, it works. however, my results is empty where could be the problem? 

> results <- t[which(abs(t$logFC) > 1.5 & t$adj.P.Val < 0.05),]
> results
[1] logFC     AveExpr   t         P.Value   adj.P.Val B        
<0 rows> (or 0-length row.names)

I just looked at adj.P Val and I found most of them are higher than 0.05

ADD REPLY
1
Entering edit mode

results are empty because with the line above you keep only values with adj pvalue < 0.05, which are none.

you can use 

results <- t[which(t$P.Value < 0.05),]

this will give you those with p.value (NOT FDR) smaller than 0.05

check which one is the correct column name, if it is t$P.Value or t$P.Val or t$p.value, you want to get the spelling right ;)

you can get the up/down-regulated genes by

up_down_genes <- results[which(abs(results$logFC) > 1),]

that will subset genes based on a log2 fold changes of 1, you can change that value as you wish. negative values = downregulated, positive values = upregulated

ADD REPLY
0
Entering edit mode

I tried to find a cut-off for logFC, however, i want to know your opinion. 

# make the table 
t <- topTable(fit2,n=Inf,coef=2,adjust="fdr");

# to check how many possitive LogFC are in my topTable
apply(t, 2, function(x) length(x[x<0]))
which is 8092
apply(t, 2, function(x) length(x[x>0]))
which is 14185
# to check the range of the data 
range (t[,2])
-0.9089096  1.7957690

 Therefore, I think 0.5 would be a good choice to distinguish up from down regulated, no? if so, I tried to find them as you said 

# find those which have lower p value than 0.05
results <- t[which(t$P.Value < 0.05),]

then 

up_genes <- results[which((results$logFC) > 0.5),]

here I removed the abs to only select the positive ones 

down_genes <- results[which((results$logFC) < 0.5 ),]

Does this make any sense ? if so, I am trying to plot its heatmap as follows:

library(gplots)
idx = which(results$P.Value <0.05)
heatmap.2(as.matrix(data[idx,]),trace='none',scale='row') 
ADD REPLY
0
Entering edit mode

make a volcano plot of your topTable of logFC and p-value and see if you have some genes which are differentially expressed 

ADD REPLY
0
Entering edit mode

@ Manvendra Singh based on volcano plot, you hardly can say something, for example, look at my plot 

volcanoplot(fit2,coef=2,highlight=20, pch=20)

http://tinypic.com/view.php?pic=2e3683b&s=8#.VOTdHEIk_dk

if you know any other way, I will be happy to know

ADD REPLY
1
Entering edit mode

try this

library(ggplot2)

res <- topTable(fit,coef=2,adjust="BH",n=Inf)

# the values below come from those u used already

res[,6] <- ifelse((res$P.Val < 0.05 & abs(res$logFC) > 0.5), "red", "black") 

size <- ifelse((res$P.Val < 0.05 & abs(res$logFC) > 1.5), 4, 2)

##Construct the plot object
g <- ggplot(data=res, aes(x=res[,1], y=-log10(res$P.Val))) +
  geom_point(size=size, colour=res[,6]) +
  xlim(c(-3, 3)) + ylim(c(0,8)) +
  xlab("log2 fold change") + ylab("-log10  p-value") +
  guides(colour = guide_legend(override.aes = list(shape=16)))
g

 

the significant one will have different color and size

ADD REPLY
0
Entering edit mode

@TriS thanks !  Do you know how to put the ID(probes names) on those significant ones ? 

you highlighted those which have P.value smaller than 0.05 and LogFC higher than 0.5 with red color and those which are much higher than 1.5 with a bigger size and of course red. it is handy and beautiful 

http://tinypic.com/view.php?pic=2gtyvs8&s=8#.VOUT-EIk_dk

ADD REPLY
1
Entering edit mode

if you just want to label the big and red ones then you can do it as follow:

lab_array <- ifelse((res$P.Val < 0.05 & abs(res$logFC) > 1.5), rownames(res),"")

g <- ggplot(data=res, aes(x=res[,1], y=-log10(res$P.Val))) +
  geom_point(size=size, colour=res[,6]) +
  xlim(c(-3, 3)) + ylim(c(0,8)) +
  xlab("log2 fold change") + ylab("-log10  p-value") +
  guides(colour = guide_legend(override.aes = list(shape=16)))
  geom_text(aes(label=lab_array),hjust=0, vjust=0)
g

the geom_text() function allows to do that

remember, we are happy to help but Google is also a good friend :)

 

ADD REPLY
0
Entering edit mode
you can go this way too

fit2<-contrasts.fit(fit, cont.matrix)
ebfit<-eBayes(fit2)

topTable(ebfit, number=Inf, adjust.method="none", coef=1)
ADD REPLY
2
Entering edit mode

Oh, that design is not going to tell you anything useful. You're just going to get two intercepts. If the first 1000 samples are your experimental group and the last 100 the control group, then:

mm <- model.matrix(~c(rep(1,1000),rep(0,100)))

mm is the model matrix appropriate for lmFit().

ADD REPLY
0
Entering edit mode

Now I did it like this, do you think it is OK now? how then I recognise my up and down regulated ?

> dim(data)
[1] 22277  1098 

cont.matrix <- makeContrasts(UnvsCO=untreated-controlled, levels=mm)
Error in makeContrasts(UnvsCO = untreated - controlled, levels = mm) : 
  The levels must by syntactically valid names in R, see help(make.names).  Non-valid names: c(rep(1, 990), rep(0, 108))
In addition: Warning message:
In makeContrasts(UnvsCO = untreated - controlled, levels = mm) :
  Renaming (Intercept) to Intercept
fit <- lmFit(data, mm)

fit2 <- eBayes(fit)

t<- topTable(fit2, coef=2)

                         logFC      AveExpr        t     P.Value adj.P.Val
AFFX-r2-P1-cre-5_at 0.25433030 -0.027391348 2.790941 0.005346628 0.9936675
AFFX-ThrX-3_at      0.14424968 -0.008132332 2.744752 0.006154509 0.9936675
AFFX-TrpnX-3_at     0.12906212 -0.007293625 2.730567 0.006423832 0.9936675
AFFX-ThrX-M_at      0.10589384 -0.008488616 2.672831 0.007632963 0.9936675
AFFX-r2-P1-cre-3_at 0.14625313 -0.012365756 2.617852 0.008970330 0.9936675
205476_at           1.03358207  0.931921038 2.476302 0.013425076 0.9936675
219608_s_at         0.09955643 -0.006128597 2.451229 0.014392052 0.9936675
201110_s_at         1.99167522  1.795769035 2.422080 0.015593167 0.9936675
212158_at           1.06424522  0.959561658 2.351705 0.018863655 0.9936675
205239_at           1.50998534  1.361463115 2.283328 0.022601553 0.9936675
                            B
AFFX-r2-P1-cre-5_at -2.351052
AFFX-ThrX-3_at      -2.470447
AFFX-TrpnX-3_at     -2.506720
AFFX-ThrX-M_at      -2.652445
AFFX-r2-P1-cre-3_at -2.788357
205476_at           -3.125444
219608_s_at         -3.183222
201110_s_at         -3.249665
212158_at           -3.406837
205239_at           -3.555152
ADD REPLY
0
Entering edit mode

Those would be vastly more useful results. Before, you were asking, "what has an average signal reliably above 0?", which I rather doubt you actually cared about. The fold-changes here are relative to the 1000 control samples. Granted, there's nothing significant, but that's a different matter.

ADD REPLY
0
Entering edit mode

@Devon Ryan thanks for your valuable comment ! I really appreciate your help and @TriS 

So does it mean , my genes are not statistically significant expressed ? thus, I cannot say which one is up and which one is down regulated ? 

ADD REPLY
0
Entering edit mode

No, many of them are significantly expressed, rather the model your using doesn't suggest that there are any differences between the two groups. Of course, whether the model is appropriate to the actual experimental design is a different question. I wouldn't be surprised if you have some other experimental factor that needs to be controlled.

ADD REPLY
0
Entering edit mode

if you just wanna see the genes up and down check my code below. for the "significance" you would like to have FDR (or adjusted p.value) < 0.05. this means that you accept that 5% of your genes could be false positive. you could use 0.1 or 0.2 but I don't take hard conclusions out of it and be aware that 10% or 20% o your data might be false positive. your p.values are relative to the single comparison. each time you do multiple tests (in this case multiple moderated t.tests, 22277 times), you add the chance of creating errors, that's why you correct or multiple testing. however, if you check sum(t$p.value < 0.05) i think you will find very few hits.

also, I'm not sure about the error you got up there...

ADD REPLY
0
Entering edit mode

@TriS is it not strange that I get ZERO?

sum(t$p.value < 0.05)
[1] 0
ADD REPLY
1
Entering edit mode

Given the results you showed 2 hours ago, no, that's expected. I suspect that sum(t$p.value<0.9) is also 0.

ADD REPLY
0
Entering edit mode

@Devon Ryan YES, that is also ZERO, why? where could be the problem ? 

ADD REPLY
0
Entering edit mode

something is wrong here ! Look at the plot I just draw 

results <- decideTests(fit2)
> summary(results)
   (Intercept) c(rep(1, 990), rep(0, 108))
-1           0                           0
0        22277                       22277
1            0                           0
> vennDiagram(results)

 

http://tinypic.com/view.php?pic=11gsot2&s=8#.VORL6ELsefQ

plotMA(fit2, 2)

http://tinypic.com/view.php?pic=n4f575&s=8#.VORLskLsefQ

volcanoplot(fit2)

http://tinypic.com/view.php?pic=mj4ldc&s=8#.VORMBkLsefQ

ADD REPLY
0
Entering edit mode

I rather obviously can't see any plot.

ADD REPLY
0
Entering edit mode

I don't know why I cannot share the images! I will try to work on it a little bit more and then inform you with new findings! 

ADD REPLY
0
Entering edit mode

Biostars doesn't host images or other files for you. Upload them somewhere and link to them.

ADD REPLY
0
Entering edit mode

@Devon Ryan I have uploaded them on a website and you can see them above! 

It is funny when i use the design like 

design <- t(xor(t(replicate(2, c(rep(TRUE, 1000), rep(FALSE, 100)))), c(0, 1))) * 1L 

I get figures more logical ! 

ADD REPLY
0
Entering edit mode

With your strange xor-based design, you're calculating two intercepts. You're strongly advised to chat with a local statistician, who can more easily go over the details of your experiment and show you visually what your xor-based design is actually doing (not what you want). Some of the images you posted do look strange, so I suspect there are other issues with how the experiment is being analysed. Given that you seem to be more or less completely new at this, it'll be vastly more efficient to have someone local help you further.
 

ADD REPLY
0
Entering edit mode

@Devon Ryan After reading Limma Tutorial and other posts on Booster, I definitely understand I must use model.matrix and I really appreciate your comments. There should be other problem with the data which I must figure out. Therefore, I repeat all steps I have done for the data and see whether the results change or not !   

ADD REPLY
0
Entering edit mode

There isn't necessarily a problem. Perhaps there are simply no differentially expressed probes. Alternatively, perhaps the normalization wasn't performed correctly or the design wasn't sufficient to describe your data. The simplest way to find out would be to contact someone locally knowledgeable in bioinformatics. Playing with the data is vastly more efficient than getting help over the 'net.

ADD REPLY

Login before adding your answer.

Traffic: 1332 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6