As the title suggestions, I'm wondering what the best way to test each gene in a count matrix containing two groups is, where one group is expected to have a global increase in gene expression.
I need to use spike-in normalized RPKM data, so from my understanding of DESeq, it would not be appropriate for this analysis because it requires raw reads to perform it's own TMM normalization before building it's model to test for differential expression. The reason for the need to avoid library normalization after spike-in normalization is due to the nature of the expected data again having a global increase/decrease in gene expression after some treatment. If the samples are normalized for library size again after accounting for the spike-in, these differences would be masked.
I've tried using edgeR's exactTest()
and glmLRT()
functions with either the design ~0 + group
(exclude intercept) as well as ~ group
(include intercept using baseline "control" as reference), however, there are no statistically significant DEGs output by any combination above, with the exception of using ~0 + group
with glmFit()
(example below) and even it only outputs 15 DEGs with a Pvalue<1.00 (let alone something more stringent)! I would expect many more genes to be diffexp, as well as more variations in Pvalue, all 15 show a Pvalue of 0, and the other 10,000+ genes have a Pvalue of 1.0.
I need some help to find out either 1) where I may be going wrong in my setup/design, or 2) if there is another more appropriate tool for doing this kind of analysis. Maybe there isn't and just running Student's T-test on each gene and correcting for MHT is the answer?
Would really appreciate any and all input on this topic!
EXAMPLE:
I've been testing these methods with a mock count matrix, from a single real sample's transcriptome counts, and generated control and treatment replicates from it where treatment counts have been roughly 4x the control counts.
head count_matrix.tsv
gene_id treatment_rep1 treatment_rep2 treatment_rep3 control_rep1 control_rep2 control_rep3
ENSMUST00000082387.1,mt-Tf,chrM,0,68,+ 1.0157 1.0461 1.0030 0.253918 0.2844 0.2412
ENSMUST00000082388.1,mt-Rnr1,chrM,69,1024,+ 2048.5400 2109.9962 2022.9333 512.135 573.5912 486.5282
ENSMUST00000082389.1,mt-Tv,chrM,1024,1093,+ 0.0000 0.0000 0.0000 0 0.0000 0.0000
ENSMUST00000082390.1,mt-Rnr2,chrM,1093,2675,+ 2024.6480 2085.3874 1999.3399 506.162 566.9014 480.8539
ENSMUST00000082391.1,mt-Tl1,chrM,2675,2750,+ 5.5253 5.6910 5.4562 1.38132 1.5471 1.3123
ENSMUST00000082392.1,mt-Nd1,chrM,2750,3707,+ 25231.6800 25988.6304 24916.2840 6307.92 7064.8704 5992.5240
ENSMUST00000082393.1,mt-Ti,chrM,3705,3774,+ 0.0000 0.0000 0.0000 0 0.0000 0.0000
ENSMUST00000082394.1,mt-Tq,chrM,3771,3842,- 51.5560 53.1027 50.9115 12.889 14.4357 12.2445
ENSMUST00000082395.1,mt-Tm,chrM,3844,3913,+ 5.0048 5.1549 4.9422 1.25119 1.4013 1.1886
Then in R, import the data, setup groups and design, estimate dispersion, fit a model to the data, and run the LRT test:
library(edgeR)
d <- read.table('count_matrix.tsv',header=T,sep='\t', row.names=1)
head(d)
treatment_rep1 treatment_rep2
ENSMUST00000082387.1,mt-Tf,chrM,0,68,+ 1.0157 1.0461
ENSMUST00000082388.1,mt-Rnr1,chrM,69,1024,+ 2048.5400 2109.9962
ENSMUST00000082389.1,mt-Tv,chrM,1024,1093,+ 0.0000 0.0000
ENSMUST00000082390.1,mt-Rnr2,chrM,1093,2675,+ 2024.6480 2085.3874
ENSMUST00000082391.1,mt-Tl1,chrM,2675,2750,+ 5.5253 5.6910
ENSMUST00000082392.1,mt-Nd1,chrM,2750,3707,+ 25231.6800 25988.6304
treatment_rep3 control_rep1
ENSMUST00000082387.1,mt-Tf,chrM,0,68,+ 1.0030 0.253918
ENSMUST00000082388.1,mt-Rnr1,chrM,69,1024,+ 2022.9333 512.135000
ENSMUST00000082389.1,mt-Tv,chrM,1024,1093,+ 0.0000 0.000000
ENSMUST00000082390.1,mt-Rnr2,chrM,1093,2675,+ 1999.3399 506.162000
ENSMUST00000082391.1,mt-Tl1,chrM,2675,2750,+ 5.4562 1.381320
ENSMUST00000082392.1,mt-Nd1,chrM,2750,3707,+ 24916.2840 6307.920000
control_rep2 control_rep3
ENSMUST00000082387.1,mt-Tf,chrM,0,68,+ 0.2844 0.2412
ENSMUST00000082388.1,mt-Rnr1,chrM,69,1024,+ 573.5912 486.5282
ENSMUST00000082389.1,mt-Tv,chrM,1024,1093,+ 0.0000 0.0000
ENSMUST00000082390.1,mt-Rnr2,chrM,1093,2675,+ 566.9014 480.8539
ENSMUST00000082391.1,mt-Tl1,chrM,2675,2750,+ 1.5471 1.3123
ENSMUST00000082392.1,mt-Nd1,chrM,2750,3707,+ 7064.8704 5992.5240
# check row count
nrow(d)
[1] 47285
# make group array
group <- factor(c("treatment","treatment","treatment","control","control","control"))
# relevel to force "control" as baseline reference (if using "~ group", otherwise first factor after alphabetically sorting will be used as ref)
group <- relevel(group, ref="control")
# make DGE object
dge <- DGEList(counts = d, group=group)
# remove low count genes
keep <- filterByExpr(dge)
y <- dge[keep,,keep.lib.sizes=FALSE]
# check row count
nrow(y)
[1] 10396
# make the experimental design
design <- model.matrix(~0 + group)
# use the estimateDisp function to estimate dispersion
y <- estimateDisp(y, design)
# fit a negative binomial model to the data using the estimated dispersion
fit <- glmFit(y, design)
# the LRT tests whether there is evidence for differences in gene expression between groups
lrt <- glmLRT(fit)
# I DO GET THIS WARNING MESSAGE, but I think it's not that critical compared to my main issue, but please enlighten me if this might be the root of it
Warning message:
Zero sample variances detected, have been offset away from zero
# examine the results
nrow(topTags(lrt, n = 1000000, adjust.method = "BH", sort.by = "logFC", p.value = 1))
[1] 10396
nrow(topTags(lrt, n = 1000000, adjust.method = "BH", sort.by = "logFC", p.value = 0.01))
[1] 15
topTags(lrt, n = 1000000, adjust.method = "BH", sort.by = "logFC", p.value = 0.01)
logFC logCPM LR
NM_009983,Ctsd,chr7,142375909,142387827,- -7.634901 12.29709 4256063
NM_018853,Rplp1,chr9,61913282,61914510,- -7.633497 12.29849 5402128
NM_021278,Tmsb4x,chrX,167207093,167209218,- -7.541610 12.39035 82863006
NM_001130180,Tnnt2,chr1,135836333,135852268,+ -7.452161 12.47978 163050533
NM_001289785,Cryab,chr9,50752970,50756635,+ -7.310952 12.62095 299950423
NM_010861,Myl2,chr5,122100950,122106854,+ -7.102323 12.82953 527672659
NM_009608,Actc1,chr2,114047283,114052875,- -7.048312 12.88353 592028784
NM_009696,Apoe,chr7,19696243,19699188,- -6.969890 12.96194 689728337
NM_010240,Ftl1,chr7,45457943,45459886,- -6.927231 13.00459 745072485
NM_001164248,Tpm1,chr9,67027889,67049213,- -6.905155 13.02666 774340353
NM_010239,Fth1,chr19,9982645,9985111,+ -6.795993 13.13581 925594488
NM_008725,Nppa,chr4,148000721,148002074,+ -6.557433 13.37433 1297094908
ENSMUST00000082421.1,mt-Cytb,chrM,14144,15288,+ -6.446244 13.48551 1491403309
ENSMUST00000082392.1,mt-Nd1,chrM,2750,3707,+ -6.315453 13.61628 1739006348
ENSMUST00000082402.1,mt-Co1,chrM,5327,6872,+ -5.833678 14.09801 2857628872
PValue FDR
NM_009983,Ctsd,chr7,142375909,142387827,- 0 0
NM_018853,Rplp1,chr9,61913282,61914510,- 0 0
NM_021278,Tmsb4x,chrX,167207093,167209218,- 0 0
NM_001130180,Tnnt2,chr1,135836333,135852268,+ 0 0
NM_001289785,Cryab,chr9,50752970,50756635,+ 0 0
NM_010861,Myl2,chr5,122100950,122106854,+ 0 0
NM_009608,Actc1,chr2,114047283,114052875,- 0 0
NM_009696,Apoe,chr7,19696243,19699188,- 0 0
NM_010240,Ftl1,chr7,45457943,45459886,- 0 0
NM_001164248,Tpm1,chr9,67027889,67049213,- 0 0
NM_010239,Fth1,chr19,9982645,9985111,+ 0 0
NM_008725,Nppa,chr4,148000721,148002074,+ 0 0
ENSMUST00000082421.1,mt-Cytb,chrM,14144,15288,+ 0 0
ENSMUST00000082392.1,mt-Nd1,chrM,2750,3707,+ 0 0
ENSMUST00000082402.1,mt-Co1,chrM,5327,6872,+ 0 0
Please let me know if there's anything else I may have not considered and left out.
Thanks for your time in advance!
Robert
Can we take it step by step, because there is a lot of inaccuracy in your post.
DESeq (or DESeq2? what you mean) does not use a TMM normalization, edgeR does, and for both tools you could theoretically skip the normalization step. And for both tools RPKM is inappropriate as this normalization distorts the so-called mean-variance trend which the tools take their power from when analysing experiments with low n such as your 3 vs 3.
That's the same, just that for an intercept-less design you have to use contrasts, but the results are precisely the same if done properly.
Here comes the problem. FIrst of all, if you know that there is global shift, why even doing a gene-wise differential analysis. The goal of DESeq2/edgeR is to inform about differential expression of individual genes, not assessing global shifts. What the tools give you is an assessment of differential expression relative to your baseline which you normalize to. I don't really see the point here.
If you want to show global shifts you probably do not even need such an analysis but could jsut use the spike-in normalized data and show some sort of plot that visualizes the mean of all genes in condition 1 vs condition 2. That will tell whether there's a global shift.
This part is competely wrong. With intercept-less design and without using contrasts you're just testing whether expression is zero or not, and this for most genes is true, hence the large logFCs and pvalues being 0. Read the user guide on how to use contrasts.
I do not really know what to advise here. What exactly is the analysis goal, so which question do you biologically want to answer?
Thanks Alexander for your comment, I appreciate your engagement on the topic.
I must have mixed those normalization methods, thanks for the correction. At any rate, for edgeR, I'm not running the
normLibSizes()
function on the DGEList, so that normalization is not being applied. As for DESeq/2, I didn't see a parameter in their manual for explicitly disabling it's internal normalization (it's possible I overlooked it, so please let me know if there is one).OK, wrt to the designs being essentially the same if done properly. This is what I suspected given the default behavior to use the alphabetically first factor in the group as the baseline reference (but then you need to apply
makeContrasts()
.Thank you for clarifying the behavior of using
~0 + group
without contrasts, this makes sense now.Essentially, the lab is asking for a list of DEGs from an experiment where global expression is expected to increase due to some treatment vs control. Would a PCA or some other clustering method be better for this, since it takes into account all genes per sample, then you can compare/contrast group clusters visually or by calculating some distance metric? Or would the sample compositions be essentially the same? If I recall correctly, clustering typically uses library normalized data as well, so there may be little variation if only the magnitude of expression (of raw data) is different between sample groups.
Do you have any other thoughts or suggestions for how to approach this data?
To add to my initial response to your comment, the objective is to return a list of genes with associated log2FC reported with some statistical significance test metric from an experiment where global expression is expected to increase due to some treatment vs control.
Given the example data above, how would you go about doing this? Would you simply use the spike-in adjusted raw data in edgeR or DESeq without a normalization step with the appropriate contrasts?