Question

EdgeR differential gene expression has impossibly low seeming P values and FDRs

0

Entering edit mode

7.8 years ago

erinboyleanderson • 0

Hello,

I've been trying to figure out how to use EdgeR to get differential gene expression. I use Tophat->htseqcount for input. I have 4 different conditions and 2 timepoints and 3 replicates (so a total of 24 samples). Using either glmfit or glmQLFit I get many thousands of signficantly differentially expressed genes between two conditions. The P-values and FDRs seem impossibly small (in the range of 10^-85 small). (Also using cuffdiff on the same dataset, I get roughly 10 differentially expressed genes per condition, which is a very different number of genes. I know that Cuffdiff tends to be more conservative, but surely it isn't THAT much more conservative.

If I do a volcano plot to visualize the data, it looks like this: All the genes are negatively downregulated. I'm pretty sure there is something going on oddly but I'm not sure what it could be. Do I need to correct p-values somehow?

enter image description here

volcano plot comparison... really oddly skewed data

This is the code I used:

library(edgeR)
#import files into R
files<-c('x18hpf_control_s1_htseq.txt','x18hpf_control_s2_htseq.txt','x18hpf_control_s3_htseq.txt','x18hpf_5A_s1_htseq.txt','x18hpf_5A_s2_htseq.txt','x18hpf_5A_s3_htseq.txt','x18hpf_5B_s1_htseq.txt','x18hpf_5B_s2_htseq.txt','x18hpf_5B_s3_htseq.txt','x18hpf_5a5b_s1_htseq.txt','x18hpf_5a5b_s2_htseq.txt','x18hpf_5a5b_s3_htseq.txt','x21hpf_control_s1_htseq.txt','x21hpf_control_s2_htseq.txt','x21hpf_control_s3_htseq.txt','x21hpf_5a_s1_htseq.txt','x21hpf_5a_s2_htseq.txt','x21hpf_5a_s3_htseq.txt','x21hpf_5b_s1_htseq.txt','x21hpf_5b_s2_htseq.txt','x21hpf_5b_s3_htseq.txt','x21hpf_5a5b_s1_htseq.txt','x21hpf_5a5b_s2_htseq.txt','x21hpf_5a5b_s3_htseq.txt')
DB<-readDGE(files,columns=c(1,2))
DB$samples
#make a design matrix
timepoint<-factor(rep(c("t18","t21"),each=12))
groups<-factor(rep(c("wt","tbx5a","tbx5b","double"), each=3,times=2))
combinations<-paste(timepoint,groups,sep=".")
f<-factor(combinations,levels=unique(combinations))
design<-model.matrix(~0+f)
colnames(design)<-levels(f)
design
# Filter and QC -----------------------------------------------------------


#filter out genes with low counts
#use counts per million bs of differences in library sizes
#1 CPM is approx = to 10 in smallest sample (96943)
keep<-rowSums(cpm(DB)>1)>=2
DB<-DB[keep,keep.lib.sizes=FALSE]
DB<-estimateDisp(DB,design)

#So need a GLM model and a design matrix
fit<-glmFit(DB,design)
#wt vs 5a 18 hpf
lrt.18.5a<-glmLRT(fit,coef=2)
tab1<-topTags(lrt.18.5a, n=Inf)
write.csv(tab1,file="18hpf control genes vs 5a.csv")
# and similar for the other conditions

## try a QL F-test which is preferred as it reflects the uncertainty in estimating the dispersion/ is more robust and reliable apparently
fit<-glmQLFit(DB, design)
qlf.18wtvs5a<-glmQLFTest(fit,coef=2)
topTags(qlf.18wtvs5a)
tab1a<-topTags(qlf.18wtvs5a,n=Inf)
write.csv(tab1a,file="qlf 18hpf wt vs 5a.csv")
#no I am still literally finding THOUSANDS of genes

RNA-Seq R • 3.2k views

ADD COMMENT • link updated 7.8 years ago by Devon Ryan 104k • written 7.8 years ago by erinboyleanderson • 0

0

Entering edit mode

Link to image of volcano plot https://postimg.org/image/yev3egsuv/f99b03e7/ because it didn't show up in earlier comment.

ADD REPLY • link 7.8 years ago by erinboyleanderson • 0

0

Entering edit mode

The web page http://postimg.cc/image/yev3egsuv/ is blocked by the Internet Security Filter because This Websense category is filtered: Suspicious Content.

Can you try another host for the image?

ADD REPLY • link 7.8 years ago by karl.stamm 4.1k

0

Entering edit mode

Images can now be seen in the original post (at least I can see them).

ADD REPLY • link 7.8 years ago by GenoMax 142k

GenoMax · Answer 1 · 2016-08-01

2

Entering edit mode

7.8 years ago

Devon Ryan 104k

You're missing a rather important part, namely a contrast. At the moment, you're just testing whether genes have counts greater than 0, which should lead to a very very skewed p-value and fold-change distribution.

ADD COMMENT • link 7.8 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you!

So to test for instance if there is a difference between the first set of samples and the second set of samples, I should use the contrast c(1,-1,0,0,0,0,0,0) because it would be testing if there is a significant difference between the control/first sample and the second sample? This does seem to stop the data from being skewed and makes the p-values much more reasonable. There's a high FDR unfortunately, but that seems like it might be because of the data itself rather than coding problems.

lrt.18.5aCONTRAST<-glmLRT(fit,contrast=c(1,-1,0,0,0,0,0,0))
tab1aa<-topTags(lrt.18.5aCONTRAST,n=Inf)
write.csv(tab1aa,file="18hpf wt vs tbx5a contrast.csv")`

This is how my design matrix is set up:

> design
   t18.wt t18.tbx5a t18.tbx5b t18.double t21.wt t21.tbx5a t21.tbx5b t21.double

1       1         0         0          0      0         0         0          0
2       1         0         0          0      0         0         0          0
3       1         0         0          0      0         0         0          0
4       0         1         0          0      0         0         0          0
5       0         1         0          0      0         0         0          0
6       0         1         0          0      0         0         0          0
7       0         0         1          0      0         0         0          0
8       0         0         1          0      0         0         0          0
9       0         0         1          0      0         0         0          0
10      0         0         0          1      0         0         0          0
11      0         0         0          1      0         0         0          0
12      0         0         0          1      0         0         0          0
13      0         0         0          0      1         0         0          0
14      0         0         0          0      1         0         0          0
15      0         0         0          0      1         0         0          0
16      0         0         0          0      0         1         0          0
17      0         0         0          0      0         1         0          0
18      0         0         0          0      0         1         0          0
19      0         0         0          0      0         0         1          0
20      0         0         0          0      0         0         1          0
21      0         0         0          0      0         0         1          0
22      0         0         0          0      0         0         0          1
23      0         0         0          0      0         0         0          1
24      0         0         0          0      0         0         0          1
attr(,"assign")
[1] 1 1 1 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$f
[1] "contr.treatment"

improved volcano plot http://imgur.com/a/hulvx

ADD REPLY • link updated 7.8 years ago by GenoMax 142k • written 7.8 years ago by erinboyleanderson • 0

0

Entering edit mode

Yup, that looks like the correct result (sorry the p-values aren't better).

ADD REPLY • link 7.8 years ago by Devon Ryan 104k