I've been trying to figure out how to use EdgeR to get differential gene expression. I use Tophat->htseqcount for input. I have 4 different conditions and 2 timepoints and 3 replicates (so a total of 24 samples). Using either glmfit or glmQLFit I get many thousands of signficantly differentially expressed genes between two conditions. The P-values and FDRs seem impossibly small (in the range of 10^-85 small). (Also using cuffdiff on the same dataset, I get roughly 10 differentially expressed genes per condition, which is a very different number of genes. I know that Cuffdiff tends to be more conservative, but surely it isn't THAT much more conservative.
If I do a volcano plot to visualize the data, it looks like this: All the genes are negatively downregulated. I'm pretty sure there is something going on oddly but I'm not sure what it could be. Do I need to correct p-values somehow?
This is the code I used:
library(edgeR) #import files into R files<-c('x18hpf_control_s1_htseq.txt','x18hpf_control_s2_htseq.txt','x18hpf_control_s3_htseq.txt','x18hpf_5A_s1_htseq.txt','x18hpf_5A_s2_htseq.txt','x18hpf_5A_s3_htseq.txt','x18hpf_5B_s1_htseq.txt','x18hpf_5B_s2_htseq.txt','x18hpf_5B_s3_htseq.txt','x18hpf_5a5b_s1_htseq.txt','x18hpf_5a5b_s2_htseq.txt','x18hpf_5a5b_s3_htseq.txt','x21hpf_control_s1_htseq.txt','x21hpf_control_s2_htseq.txt','x21hpf_control_s3_htseq.txt','x21hpf_5a_s1_htseq.txt','x21hpf_5a_s2_htseq.txt','x21hpf_5a_s3_htseq.txt','x21hpf_5b_s1_htseq.txt','x21hpf_5b_s2_htseq.txt','x21hpf_5b_s3_htseq.txt','x21hpf_5a5b_s1_htseq.txt','x21hpf_5a5b_s2_htseq.txt','x21hpf_5a5b_s3_htseq.txt') DB<-readDGE(files,columns=c(1,2)) DB$samples #make a design matrix timepoint<-factor(rep(c("t18","t21"),each=12)) groups<-factor(rep(c("wt","tbx5a","tbx5b","double"), each=3,times=2)) combinations<-paste(timepoint,groups,sep=".") f<-factor(combinations,levels=unique(combinations)) design<-model.matrix(~0+f) colnames(design)<-levels(f) design # Filter and QC ----------------------------------------------------------- #filter out genes with low counts #use counts per million bs of differences in library sizes #1 CPM is approx = to 10 in smallest sample (96943) keep<-rowSums(cpm(DB)>1)>=2 DB<-DB[keep,keep.lib.sizes=FALSE] DB<-estimateDisp(DB,design) #So need a GLM model and a design matrix fit<-glmFit(DB,design) #wt vs 5a 18 hpf lrt.18.5a<-glmLRT(fit,coef=2) tab1<-topTags(lrt.18.5a, n=Inf) write.csv(tab1,file="18hpf control genes vs 5a.csv") # and similar for the other conditions ## try a QL F-test which is preferred as it reflects the uncertainty in estimating the dispersion/ is more robust and reliable apparently fit<-glmQLFit(DB, design) qlf.18wtvs5a<-glmQLFTest(fit,coef=2) topTags(qlf.18wtvs5a) tab1a<-topTags(qlf.18wtvs5a,n=Inf) write.csv(tab1a,file="qlf 18hpf wt vs 5a.csv") #no I am still literally finding THOUSANDS of genes