Question

EdgeR analysis for DE genes with different time courses

0

Entering edit mode

4.4 years ago

ali.hakimzadeh ▴ 10

HI,

I have 20 samples of leishmaniasis vaccine that all of them belong to a low dose, with four different time courses, which each has five replicate. i prepare a count table which contains gene names, and sample names with count numbers. I intend to make the identification of DE genes using a log2 fold change and likelihood ratios (LR) Test in edgeR and significantly expressed genes had an FDR adjusted P-value of < 5%. Since i am new to EdgeR, i want to know if i am following the right path or not? because in the end number of the DE genes that i get is 10.

Thanks in advance

library(edgeR)
setwd("~/Desktop/counts/low")
x_low<-read.csv("counts.txt")
time<-factor(c("pre","2hr-post","24hr-post","14d-post",
           "pre","2hr-post","24hr-post","14d-post",
           "pre","2hr-post","24hr-post","14d-post",
           "pre","2hr-post","24hr-post","14d-post",
           "pre","2hr-post","24hr-post","14d-post"))
time <- relevel(time, ref="pre")
y_low<-DGEList(counts=x_low[,2:21], genes = x_low[,1], group = time) #read counts.csv and make table  
keep <- filterByExpr(y_low)
y_low <-y_low[keep, keep.lib.sizes=FALSE] 
y_low <- calcNormFactors(y_low)
plotMDS(y_low)
data.frame(sample=colnames(y_low),time) #data frame
      sample      time
1   Sample_1       pre
2   Sample_2  2hr-post
3   Sample_3 24hr-post
4   Sample_4  14d-post
5   Sample_5       pre
6   Sample_6  2hr-post
7   Sample_7 24hr-post
8   Sample_8  14d-post
9   Sample_9       pre
10 Sample_10  2hr-post
11 Sample_11 24hr-post
12 Sample_12  14d-post
13 Sample_13       pre
14 Sample_14  2hr-post
15 Sample_15 24hr-post
16 Sample_16  14d-post
17 Sample_17       pre
18 Sample_18  2hr-post
19 Sample_19 24hr-post
20 Sample_20  14d-post
design<-model.matrix(~time) 
y_low<-estimateDisp(y_low,design)
fit<-glmFit(y_low,design)
lrt<-glmLRT(fit)
deg = topTags(lrt,p.value = 0.05)$table

the result :

                genes     logFC      logCPM        LR PValue FDR
29374 ENSG00000227740 -20.78082 -0.06178923  1898.767      0   0
29967 ENSG00000228601 -20.70710  0.01400828  4597.612      0   0
3640  ENSG00000108924 -20.66524  0.04755844  3114.040      0   0
5382  ENSG00000123119 -20.65702  0.10512820  1486.161      0   0
2370  ENSG00000100600 -20.64387  0.21416278  1731.573      0   0
43848 ENSG00000252756 -20.62157  0.07501494  3396.423      0   0
29983 ENSG00000228624 -20.61890  0.10010055  1563.587      0   0
31399 ENSG00000230583 -20.61545  0.08888092  1783.825      0   0
52898 ENSG00000265784 -20.57370  0.24883706 17582.995      0   0
54539 ENSG00000267802 -20.53795  0.11711300  1954.292      0   0

EdgeR RNA-Seq R • 2.7k views

ADD COMMENT • link 4.3 years ago by ali.hakimzadeh ▴ 10

score 2 · Answer 1 · 2019-11-27

2

Entering edit mode

4.4 years ago

Gordon Smyth ★ 7.0k

You have omitted the filtering step (filterByExrp), which is required by all edgeR and limma pipelines.

You have decided to conduct one overall LRT test of all contrasts with glmLRT(fit, coef=2:4) but then you overwrite the result with glmLRT(fit) instead. I wonder if you know which contrast that has tested? It actually tests pre vs 14d-post, as you can see from looking at the levels of the time factor:

> levels(time)
[1] "14d-post"  "24hr-post" "2hr-post"  "pre"

You have set 14d-post as the reference level. By default, glmLRT tests the last coefficient in your model, which is this case is pre vs the reference level.

If you do want to use pre as the baseline, you would find it helpful to set time <- relevel(time, ref="pre")

Having said that, when people don't get DE genes, when they think they should, it is almost always because the model is mis-specified or the data contains outlier samples or batch effects that have not been accounted for. edgeR is designed in a modular way so that you can do quality assessment on the way, using plots like plotMDS, plotBCV, plotMD and so on along the way. You should not expect to simply plug complex datasets into a standard code pipeline without looking at the data and without doing any quality assessment. If you plotted your samples with plotMDS it would probably be immediately obvious why there are so few DE genes.

BTW, you have previously asked questions about edgeR on Bioconductor Support. Generally speaking, you will get more detailed help with edgeR on that forum than here on Biostars. The edgeR developers themselves don't guarantee to answer questions here.

ADD COMMENT • link 4.4 years ago by Gordon Smyth ★ 7.0k

0

Entering edit mode

Firstly thanks for your helpful response. I aim to design a test for pre vs. post to get DE genes. i got the filtering part Gordon, but about LRT, i thought that i compare pre vs. post samples when i did lrt<-glmLRT(fit, coef = 2:4), would you please explain more about that. i appreciate that.
About the asking question on platforms, i will follow next time Bioconductor.

ADD REPLY • link 4.4 years ago by ali.hakimzadeh ▴ 10

1

Entering edit mode

There are few issues here, and I have also edited my answer above.

lrt <- glmLRT(fit, coef = 2:4) compares everything to everything, because you only have 4 time-points and hence only 3 df for comparisons.
Your code immediately overwrites the above with lrt <- glmLRT(fit) so you haven't stored the above result anyway. glmLRT(fit) tests something different, as I now explain in my answer above.

ADD REPLY • link 4.4 years ago by Gordon Smyth ★ 7.0k

0

Entering edit mode

Thanks, dear Gordon . sorry for the late response, i catch the concept, and i will follow the commands as you told.

ADD REPLY • link 4.4 years ago by ali.hakimzadeh ▴ 10

0

Entering edit mode

Dear Gordon, for a week i try to tackle the problems of my script as you told, but with changes that you mentioned i neither could not get the DEG or maybe i did not follow the correct pathway. i plot my samples by plotMDS(y_low) as you advised. The results are as you see in the Image. I changed the script above and added the options as you suggested. i would like to know the problem is in the designing of my matrix, or the option that i set in glmLRT is wrong? Thanks in advance

ADD REPLY • link 4.4 years ago by ali.hakimzadeh ▴ 10

0

Entering edit mode

As I predicted in my answer, the MDS plot makes it immediately obvious why you have no DE.

The MDS plot shows a massive batch effect associated with samples 5 to 8 that you haven't accounted for. Obviously that would prevent you finding any DE because the batch effect is so much larger than the effects you are testing for.

It is seems clear from the MDS plot that each group of four samples represents a separate batch. That is an aspect of the experimental design that you need to consider and need to include in the design matrix. The biologists who generated the data should be able to explain to you how the experimental procedure worked. I already nudged you to think about these sort of issues and highlighted the possibility of batch effects in my original answer.

ADD REPLY • link 4.4 years ago by Gordon Smyth ★ 7.0k