Question: EdgeR differential gene expression has impossibly low seeming P values and FDRs
0
gravatar for erinboyleanderson
2.6 years ago by
erinboyleanderson0 wrote:

Hello,

I've been trying to figure out how to use EdgeR to get differential gene expression. I use Tophat->htseqcount for input. I have 4 different conditions and 2 timepoints and 3 replicates (so a total of 24 samples). Using either glmfit or glmQLFit I get many thousands of signficantly differentially expressed genes between two conditions. The P-values and FDRs seem impossibly small (in the range of 10^-85 small). (Also using cuffdiff on the same dataset, I get roughly 10 differentially expressed genes per condition, which is a very different number of genes. I know that Cuffdiff tends to be more conservative, but surely it isn't THAT much more conservative.

If I do a volcano plot to visualize the data, it looks like this: All the genes are negatively downregulated. I'm pretty sure there is something going on oddly but I'm not sure what it could be. Do I need to correct p-values somehow?

enter image description here

volcano plot comparison... really oddly skewed data

This is the code I used:

library(edgeR)
#import files into R
files<-c('x18hpf_control_s1_htseq.txt','x18hpf_control_s2_htseq.txt','x18hpf_control_s3_htseq.txt','x18hpf_5A_s1_htseq.txt','x18hpf_5A_s2_htseq.txt','x18hpf_5A_s3_htseq.txt','x18hpf_5B_s1_htseq.txt','x18hpf_5B_s2_htseq.txt','x18hpf_5B_s3_htseq.txt','x18hpf_5a5b_s1_htseq.txt','x18hpf_5a5b_s2_htseq.txt','x18hpf_5a5b_s3_htseq.txt','x21hpf_control_s1_htseq.txt','x21hpf_control_s2_htseq.txt','x21hpf_control_s3_htseq.txt','x21hpf_5a_s1_htseq.txt','x21hpf_5a_s2_htseq.txt','x21hpf_5a_s3_htseq.txt','x21hpf_5b_s1_htseq.txt','x21hpf_5b_s2_htseq.txt','x21hpf_5b_s3_htseq.txt','x21hpf_5a5b_s1_htseq.txt','x21hpf_5a5b_s2_htseq.txt','x21hpf_5a5b_s3_htseq.txt')
DB<-readDGE(files,columns=c(1,2))
DB$samples
#make a design matrix
timepoint<-factor(rep(c("t18","t21"),each=12))
groups<-factor(rep(c("wt","tbx5a","tbx5b","double"), each=3,times=2))
combinations<-paste(timepoint,groups,sep=".")
f<-factor(combinations,levels=unique(combinations))
design<-model.matrix(~0+f)
colnames(design)<-levels(f)
design
# Filter and QC -----------------------------------------------------------


#filter out genes with low counts
#use counts per million bs of differences in library sizes
#1 CPM is approx = to 10 in smallest sample (96943)
keep<-rowSums(cpm(DB)>1)>=2
DB<-DB[keep,keep.lib.sizes=FALSE]
DB<-estimateDisp(DB,design)

#So need a GLM model and a design matrix
fit<-glmFit(DB,design)
#wt vs 5a 18 hpf
lrt.18.5a<-glmLRT(fit,coef=2)
tab1<-topTags(lrt.18.5a, n=Inf)
write.csv(tab1,file="18hpf control genes vs 5a.csv")
# and similar for the other conditions

## try a QL F-test which is preferred as it reflects the uncertainty in estimating the dispersion/ is more robust and reliable apparently
fit<-glmQLFit(DB, design)
qlf.18wtvs5a<-glmQLFTest(fit,coef=2)
topTags(qlf.18wtvs5a)
tab1a<-topTags(qlf.18wtvs5a,n=Inf)
write.csv(tab1a,file="qlf 18hpf wt vs 5a.csv")
#no I am still literally finding THOUSANDS of genes
rna-seq R • 1.4k views
ADD COMMENTlink modified 2.6 years ago by Devon Ryan88k • written 2.6 years ago by erinboyleanderson0

Link to image of volcano plot https://postimg.org/image/yev3egsuv/f99b03e7/ because it didn't show up in earlier comment.

ADD REPLYlink written 2.6 years ago by erinboyleanderson0

The web page http://postimg.cc/image/yev3egsuv/ is blocked by the Internet Security Filter because This Websense category is filtered: Suspicious Content.

Can you try another host for the image?

ADD REPLYlink written 2.6 years ago by karl.stamm3.4k

Images can now be seen in the original post (at least I can see them).

ADD REPLYlink written 2.6 years ago by genomax64k
2
gravatar for Devon Ryan
2.6 years ago by
Devon Ryan88k
Freiburg, Germany
Devon Ryan88k wrote:

You're missing a rather important part, namely a contrast. At the moment, you're just testing whether genes have counts greater than 0, which should lead to a very very skewed p-value and fold-change distribution.

ADD COMMENTlink written 2.6 years ago by Devon Ryan88k

Thank you!

So to test for instance if there is a difference between the first set of samples and the second set of samples, I should use the contrast c(1,-1,0,0,0,0,0,0) because it would be testing if there is a significant difference between the control/first sample and the second sample? This does seem to stop the data from being skewed and makes the p-values much more reasonable. There's a high FDR unfortunately, but that seems like it might be because of the data itself rather than coding problems.

lrt.18.5aCONTRAST<-glmLRT(fit,contrast=c(1,-1,0,0,0,0,0,0))
tab1aa<-topTags(lrt.18.5aCONTRAST,n=Inf)
write.csv(tab1aa,file="18hpf wt vs tbx5a contrast.csv")`

This is how my design matrix is set up:

> design
   t18.wt t18.tbx5a t18.tbx5b t18.double t21.wt t21.tbx5a t21.tbx5b t21.double

1       1         0         0          0      0         0         0          0
2       1         0         0          0      0         0         0          0
3       1         0         0          0      0         0         0          0
4       0         1         0          0      0         0         0          0
5       0         1         0          0      0         0         0          0
6       0         1         0          0      0         0         0          0
7       0         0         1          0      0         0         0          0
8       0         0         1          0      0         0         0          0
9       0         0         1          0      0         0         0          0
10      0         0         0          1      0         0         0          0
11      0         0         0          1      0         0         0          0
12      0         0         0          1      0         0         0          0
13      0         0         0          0      1         0         0          0
14      0         0         0          0      1         0         0          0
15      0         0         0          0      1         0         0          0
16      0         0         0          0      0         1         0          0
17      0         0         0          0      0         1         0          0
18      0         0         0          0      0         1         0          0
19      0         0         0          0      0         0         1          0
20      0         0         0          0      0         0         1          0
21      0         0         0          0      0         0         1          0
22      0         0         0          0      0         0         0          1
23      0         0         0          0      0         0         0          1
24      0         0         0          0      0         0         0          1
attr(,"assign")
[1] 1 1 1 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$f
[1] "contr.treatment"

improved volcano plot http://imgur.com/a/hulvx

ADD REPLYlink modified 2.6 years ago by genomax64k • written 2.6 years ago by erinboyleanderson0

Yup, that looks like the correct result (sorry the p-values aren't better).

ADD REPLYlink written 2.6 years ago by Devon Ryan88k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1866 users visited in the last hour