Hi all,
I'm another wet-lab biologist who is diving into RNA-seq analysis in R. After a horrific learning curve I'm now getting better, but I need your help. I have RNA-seq data for bacteria-infected host cells at four time-points (2, 4, 6, 8 hpi), with uninfected controls at each time-point. I want DE genes between infected and uninfected samples at each time-point, but don't really care about comparisons between time-points. The problem is that I can only get 1-2 DEG, which I know is not correct. I suspect my problem is with my design matrix, but am not sure. I am wondering if someone could have a quick look at my (summarized) pipeline below and nudge me in the right direction?
Many thanks for your help.
# Import count matrix (after filtering of dodgy and low-read samples)
>    counts <- read.csv(file = "counts.csv", sep = ",", header = TRUE, row.names = 1, stringsAsFactors = FALSE)
# Import meta table
>    meta <- read.csv(file = "meta.csv", header = TRUE, stringsAsFactors = FALSE)
>    meta
sampleName            treatment            time
2hpi_Infection_01    Infection_2            2hpi
2hpi_Infection_02    Infection_2            2hpi
2hpi_Infection_03    Infection_2            2hpi
2hpi_Mock_01        Mock_Infection_2    2hpi
2hpi_Mock_02        Mock_Infection_2    2hpi
2hpi_Mock_03        Mock_Infection_2    2hpi
4hpi_Infection_01    Infection_4            4hpi
4hpi_Infection_02    Infection_4            4hpi
4hpi_Infection_03    Infection_4            4hpi
4hpi_Mock_01        Mock_Infection_4    4hpi
4hpi_Mock_02        Mock_Infection_4    4hpi
4hpi_Mock_03        Mock_Infection_4    4hpi
6hpi_Infection_01    Infection_6            6hpi
6hpi_Infection_02    Infection_6            6hpi
6hpi_Infection_03    Infection_6            6hpi
6hpi_Mock_01        Mock_Infection_6    6hpi
6hpi_Mock_02        Mock_Infection_6    6hpi
6hpi_Mock_03        Mock_Infection_6    6hpi
8hpi_Infection_1    Infection_8            8hpi
8hpi_Infection_2    Infection_8            8hpi
8hpi_Infection_3    Infection_8            8hpi
8hpi_Mock_01        Mock_Infection_8    8hpi
8hpi_Mock_02        Mock_Infection_8    8hpi
8hpi_Mock_03        Mock_Infection_8    8hpi
# Create DGEList and perform TMM normalisation
>    group <- as.factor(meta$treatment)
>    dge <- DGEList(counts, group = group)
>    dge$meta <- meta
>    dge <- calcNormFactors(dge)
# Create design matrix and contrasts. I'm not interested in comparisions between time-points so I
# separated everything out to make the comparisions simpler... (see below).
>    design <- model.matrix(~ group + 0)
>    rownames(design) <- colnames(counts)
>    contrasts <- makeContrasts(Infected_vs_Mock_2 = groupInfection_2 - groupMock_Infection_2, Infected_vs_Mock_4 = groupCInfection_4 - groupMock_Infection_4, Infected_vs_Mock_6 = groupInfection_6 - groupMock_Infection_6, Infected_vs_Mock_8 = groupInfection_8 - groupMock_Infection_8, levels = design)
# LIMMA
>    y <- voomWithQualityWeights(counts = dge, design = design, normalize.method = "cyclicloess", plot = TRUE)
>    fit <- lmFit(y, design)
>    fit <- contrasts.fit(fit, contrasts)
>    fit <- eBayes(fit)
>    summary(decideTests(fit))
# Print list of differentially expressed genes
>    top_2 <- topTable(fit, coef = "Infected_vs_Mock_2", adjust = "fdr", sort.by = "P", lfc = 2, p.value = 0.05, number = Inf)
>    top_4 <- topTable(fit, coef = "Infected_vs_Mock_4", adjust = "fdr", sort.by = "P", lfc = 2, p.value = 0.05, number = Inf)
>    top_6 <- topTable(fit, coef = "Infected_vs_Mock_6", adjust = "fdr", sort.by = "P", lfc = 2, p.value = 0.05, number = Inf)
>    top_8 <- topTable(fit, coef = "Infected_vs_Mock_8", adjust = "fdr", sort.by = "P", lfc = 2, p.value = 0.05, number = Inf)
Thanks again,
Michelle
Your problem is probably the "lfc=2" option. You're asking for a minimum 4x fold change, which is rather large. Just don' use that option. Also, you can also use an adjusted p-value threshold of 0.1, these tests are often a bit conservative given downstream follow ups.
Ah I see - makes sense. Thanks Devon.
I probably wouldn't do the analysis the way you've designed it; I'd be more interested in differing trends over time than differential expression at specific time points (this would require modification of your design though). You will be missing modest changes that are reproducible at different time points etc.
Nonetheless, given the design in it's current form, (in addition to Devon's suggestions) you should have a look at using TopTableF instead of TopTable. This will give you those genes (transcripts?) that have some evidence for differential expression across all of the time points.