I'd appreciate some feedback on an approach for an experiment.
I am working on an RNA-Seq problem that is exploring RNA-Seq data from samples that vary on three dimensions:
- Sex (M or F)
- Timepoint (1 or 2)
- Genotype (A or B)
My colleagues and I are curious a number of different hypotheses, e.g.:
- (Q1.) What genes are DE because of Genotype? (like a "main effect" in an ANOVA)
- (Q2.) What genes are DE through an interaction of Genotype and Timepoint ?
- (Q3.) What genes are DE through an interaction of Genotype, Timepoint , and Sex?
- (...and so on).
Additionally, this data likely has batch effects I am trying to account for using SVA.
I've gotten a variety of pieces of advice on how to approach this problem:
[Approach 1] Use limma (although be careful how you set up the design matrix)
[Approach 2] Use edgeR
[Approach 3] Use DESeq, but set the design matrix to something like this:
myDESeqObject$SV1 <- svseq$sv[,1] # add in the sole surrogate variable detected by/calculated in SVA to design matrix design(myDESeqObject) <- ~ SV1 + Sex + Timepoint + Genotype + Sex:Timepoint + Timepoint:Genotype + Sex:Genotype + Sex:Genotype:Timepoint # design matrix dds <- DESeq(myDESeqObject) # perform DESeq # look for genes that are DE as a consequence of genotype alone, regardless of influence of other factors (answering Q1) results_genotype <- results(dds, contrast=c("Genotype", "A", "B")) # look for genes that are DE as a consequence of genotype alone, regardless of influence of other factors (answering Q) # (not entirely sure what to put for the specific contrasts - what I put seems like it's just looking at effect of Genotype wrt a single timepoint, so not quite answering Q2?) results_genotype_timepoint <- results(dds, contrast=c("Timepoint:Genotype", "1.A", "1.B"))
[Approach 4] Use DESeq, but create a new variable, 'group', with 8 different factors representing all possible combinations of individual variables: M_1_A, F_1_A, M_1_B, F_2_B, and so forth. Do DESeq as normal, but use the contrast function to ask what genes are DE between two specific groups, e.g. M_1_A, F_1_A.
In this way, you don't directly find the set of genes that satisfy Q1,Q2, or Q3...but to answer Q1, at least, you can look for the genes that are called in all of these analyses (e.g. an intersection of these DE gene lists):
M.1.A vs. M.1.B ("genes affected by Genotype in Males at Timepoint 1")
M.2.A vs. M.2.B ("genes affected by Genotype in Males at Timepoint 2")
F.1.A vs. F.1.B ("genes affected by Genotype in Females at Timepoint 1")
F.2.A vs. F.2.B ("genes affected by Genotype in Females at Timepoint 2")
[Approach 5] Use Approach 4 PLUS repeat analysis in edgeR - ask what genes are in common between the two analyses - focus on that subset when looking for pathway enrichments/biological functions/etc.
The end goal of this RNA-Seq experiment is to identify potential biological targets for validation in a wet lab environment, but I would like to do the best possible job with this bioinformatic portion. Which approach above do you think is the best approach, given the questions we would like to answer? Are there any approaches that stand out as obviously wrong? Any feedback to that end is appreciated.
Thanks for taking the time to read!