Hello,
I've been trying to figure out how to use EdgeR to get differential gene expression. I use Tophat->htseqcount for input. I have 4 different conditions and 2 timepoints and 3 replicates (so a total of 24 samples). Using either glmfit or glmQLFit I get many thousands of signficantly differentially expressed genes between two conditions. The P-values and FDRs seem impossibly small (in the range of 10^-85 small). (Also using cuffdiff on the same dataset, I get roughly 10 differentially expressed genes per condition, which is a very different number of genes. I know that Cuffdiff tends to be more conservative, but surely it isn't THAT much more conservative.
If I do a volcano plot to visualize the data, it looks like this: All the genes are negatively downregulated. I'm pretty sure there is something going on oddly but I'm not sure what it could be. Do I need to correct p-values somehow?
volcano plot comparison... really oddly skewed data
This is the code I used:
library(edgeR)
#import files into R
files<-c('x18hpf_control_s1_htseq.txt','x18hpf_control_s2_htseq.txt','x18hpf_control_s3_htseq.txt','x18hpf_5A_s1_htseq.txt','x18hpf_5A_s2_htseq.txt','x18hpf_5A_s3_htseq.txt','x18hpf_5B_s1_htseq.txt','x18hpf_5B_s2_htseq.txt','x18hpf_5B_s3_htseq.txt','x18hpf_5a5b_s1_htseq.txt','x18hpf_5a5b_s2_htseq.txt','x18hpf_5a5b_s3_htseq.txt','x21hpf_control_s1_htseq.txt','x21hpf_control_s2_htseq.txt','x21hpf_control_s3_htseq.txt','x21hpf_5a_s1_htseq.txt','x21hpf_5a_s2_htseq.txt','x21hpf_5a_s3_htseq.txt','x21hpf_5b_s1_htseq.txt','x21hpf_5b_s2_htseq.txt','x21hpf_5b_s3_htseq.txt','x21hpf_5a5b_s1_htseq.txt','x21hpf_5a5b_s2_htseq.txt','x21hpf_5a5b_s3_htseq.txt')
DB<-readDGE(files,columns=c(1,2))
DB$samples
#make a design matrix
timepoint<-factor(rep(c("t18","t21"),each=12))
groups<-factor(rep(c("wt","tbx5a","tbx5b","double"), each=3,times=2))
combinations<-paste(timepoint,groups,sep=".")
f<-factor(combinations,levels=unique(combinations))
design<-model.matrix(~0+f)
colnames(design)<-levels(f)
design
# Filter and QC -----------------------------------------------------------
#filter out genes with low counts
#use counts per million bs of differences in library sizes
#1 CPM is approx = to 10 in smallest sample (96943)
keep<-rowSums(cpm(DB)>1)>=2
DB<-DB[keep,keep.lib.sizes=FALSE]
DB<-estimateDisp(DB,design)
#So need a GLM model and a design matrix
fit<-glmFit(DB,design)
#wt vs 5a 18 hpf
lrt.18.5a<-glmLRT(fit,coef=2)
tab1<-topTags(lrt.18.5a, n=Inf)
write.csv(tab1,file="18hpf control genes vs 5a.csv")
# and similar for the other conditions
## try a QL F-test which is preferred as it reflects the uncertainty in estimating the dispersion/ is more robust and reliable apparently
fit<-glmQLFit(DB, design)
qlf.18wtvs5a<-glmQLFTest(fit,coef=2)
topTags(qlf.18wtvs5a)
tab1a<-topTags(qlf.18wtvs5a,n=Inf)
write.csv(tab1a,file="qlf 18hpf wt vs 5a.csv")
#no I am still literally finding THOUSANDS of genes
Link to image of volcano plot https://postimg.org/image/yev3egsuv/f99b03e7/ because it didn't show up in earlier comment.
The web page http://postimg.cc/image/yev3egsuv/ is blocked by the Internet Security Filter because This Websense category is filtered: Suspicious Content.
Can you try another host for the image?
Images can now be seen in the original post (at least I can see them).