I am new to RNA-Seq analysis, and while I have read the EdgeR manual, several publications, and many biostars posts in order to inform my analysis plan, I would like to verify that I am understanding how to analyze the data properly before I begin. [I have limited knowledge of programming in R which is based almost entirely on these aforementioned sources].
We have three strains of animals (A,B,C) which were fed on two different diets (X, Y). Each strain x diet subgroup contains three animals, giving a total of 18 samples with three replicates for each strain x diet group.
We would like to look at the effect diet has on gene expression within each strain, (e.g. A.X vs A.Y) and the differences in gene expression between strains on the same diet (e.g. A.X vs B.X vs C.X)
EDIT:Since this is a two-way ANOVA design should I also include a test to determine if there is an interaction between strain and diet?*.
group <- c("A.X", "A.Y", "B.X", "B.Y", "C.X", "C.Y") y <- DGEList(counts=count, group=group)
My first thought is to test if there are any differences in gene expression between any groups:
design <- model.matrix(~group) colnames(design) <- levels(group) fit <- glmQLFit(y, design) topTags(glmQLFTest(fit, coef=2:6))
And then if there is, to proceed to conducting the pairwise tests:
design2 <- model.matrix(~0+group) colnames(design2)<-levels(group) fit2<-glmQLFit(y,design2) con <- makeContrasts(Adiet = A.X-A.Y, Bdiet = B.X-B.Y, Cdiet = C.X-C.Y, XdietAB = A.X-B.X, YdietAB=A.Y-B.Y, etc) Adiet.results <- topTags(glmQLFTest(fit2, contrast=con[,"Adiet"]))
After I conduct all the pairwise comparisons I would re-adjust the p-values, and add the adjusted p-values to the original results table and filter out the significant genes and place them in their own table:
all.pvals <- cbind(Adiet.results$table$PValue, Bdiet.results$table$PValue ...) all.adjP <- p.adjust(all.pvals, method = "BH") Adiet.results$table$AdjP <- all.adjP[,1] Adiet.keep <- Adiet.results$table$AdjP <= 0.05 AdietDE <- Adiet.results[Adiet.keep,]
EDIT: Interaction between strain and diet:
strain <- factor(rep(c("A","B","C"),times=2)) diet <- factor(rep(c("X", "Y"), times=3)) design3 <- model.matrix(~strain*diet) fit3 <- glmQLFit(y, design3) topTags(glmQLFTest(fit3, coef=5:6))
In addition to asking if this is a reasonable design/analysis plan I had a couple additional questions:
- Within each strain, litter mates were chosen and placed on the two different diets. Is this a situation where blocking would be appropriate and if not, why not, and what would be an appropriate situation to incorporate blocking?
- Is it appropriate to do an ANOVA-like test on this dataset to detect if there are any differences between any of the groups? If I understand ANOVAs correctly, they only accurately detect differences if the data are normally distributed and have equal variances. Do the estimate dispersion steps (not shown but I will do) or any of the other steps prior to the actual testing (glmQLFtest()) account for this issue?
- When I go to correct the p-values for the multiple tests do I also need to re-correct the p-values for the ANOVA-like test to see if there are any differences between any of the groups? Or is that a stand alone test?
Thank you for all of your help, ahhn