Question

how to test for differential expression in samples where a global increase in gene expression is expected

0

Entering edit mode

4 months ago

raplayer ▴ 10

As the title suggestions, I'm wondering what the best way to test each gene in a count matrix containing two groups is, where one group is expected to have a global increase in gene expression.

I need to use spike-in normalized RPKM data, so from my understanding of DESeq, it would not be appropriate for this analysis because it requires raw reads to perform it's own TMM normalization before building it's model to test for differential expression. The reason for the need to avoid library normalization after spike-in normalization is due to the nature of the expected data again having a global increase/decrease in gene expression after some treatment. If the samples are normalized for library size again after accounting for the spike-in, these differences would be masked.

I've tried using edgeR's exactTest() and glmLRT() functions with either the design ~0 + group (exclude intercept) as well as ~ group (include intercept using baseline "control" as reference), however, there are no statistically significant DEGs output by any combination above, with the exception of using ~0 + group with glmFit() (example below) and even it only outputs 15 DEGs with a Pvalue<1.00 (let alone something more stringent)! I would expect many more genes to be diffexp, as well as more variations in Pvalue, all 15 show a Pvalue of 0, and the other 10,000+ genes have a Pvalue of 1.0.

I need some help to find out either 1) where I may be going wrong in my setup/design, or 2) if there is another more appropriate tool for doing this kind of analysis. Maybe there isn't and just running Student's T-test on each gene and correcting for MHT is the answer?

Would really appreciate any and all input on this topic!

EXAMPLE:

I've been testing these methods with a mock count matrix, from a single real sample's transcriptome counts, and generated control and treatment replicates from it where treatment counts have been roughly 4x the control counts.

head count_matrix.tsv 
gene_id treatment_rep1  treatment_rep2  treatment_rep3  control_rep1    control_rep2    control_rep3
ENSMUST00000082387.1,mt-Tf,chrM,0,68,+  1.0157  1.0461  1.0030  0.253918        0.2844  0.2412
ENSMUST00000082388.1,mt-Rnr1,chrM,69,1024,+     2048.5400       2109.9962       2022.9333       512.135 573.5912   486.5282
ENSMUST00000082389.1,mt-Tv,chrM,1024,1093,+     0.0000  0.0000  0.0000  0       0.0000  0.0000
ENSMUST00000082390.1,mt-Rnr2,chrM,1093,2675,+   2024.6480       2085.3874       1999.3399       506.162 566.9014   480.8539
ENSMUST00000082391.1,mt-Tl1,chrM,2675,2750,+    5.5253  5.6910  5.4562  1.38132 1.5471  1.3123
ENSMUST00000082392.1,mt-Nd1,chrM,2750,3707,+    25231.6800      25988.6304      24916.2840      6307.92 7064.8704  5992.5240
ENSMUST00000082393.1,mt-Ti,chrM,3705,3774,+     0.0000  0.0000  0.0000  0       0.0000  0.0000
ENSMUST00000082394.1,mt-Tq,chrM,3771,3842,-     51.5560 53.1027 50.9115 12.889  14.4357 12.2445
ENSMUST00000082395.1,mt-Tm,chrM,3844,3913,+     5.0048  5.1549  4.9422  1.25119 1.4013  1.1886

Then in R, import the data, setup groups and design, estimate dispersion, fit a model to the data, and run the LRT test:

library(edgeR)

d <- read.table('count_matrix.tsv',header=T,sep='\t', row.names=1)

head(d)
                                              treatment_rep1 treatment_rep2
ENSMUST00000082387.1,mt-Tf,chrM,0,68,+                1.0157         1.0461
ENSMUST00000082388.1,mt-Rnr1,chrM,69,1024,+        2048.5400      2109.9962
ENSMUST00000082389.1,mt-Tv,chrM,1024,1093,+           0.0000         0.0000
ENSMUST00000082390.1,mt-Rnr2,chrM,1093,2675,+      2024.6480      2085.3874
ENSMUST00000082391.1,mt-Tl1,chrM,2675,2750,+          5.5253         5.6910
ENSMUST00000082392.1,mt-Nd1,chrM,2750,3707,+      25231.6800     25988.6304
                                              treatment_rep3 control_rep1
ENSMUST00000082387.1,mt-Tf,chrM,0,68,+                1.0030     0.253918
ENSMUST00000082388.1,mt-Rnr1,chrM,69,1024,+        2022.9333   512.135000
ENSMUST00000082389.1,mt-Tv,chrM,1024,1093,+           0.0000     0.000000
ENSMUST00000082390.1,mt-Rnr2,chrM,1093,2675,+      1999.3399   506.162000
ENSMUST00000082391.1,mt-Tl1,chrM,2675,2750,+          5.4562     1.381320
ENSMUST00000082392.1,mt-Nd1,chrM,2750,3707,+      24916.2840  6307.920000
                                              control_rep2 control_rep3
ENSMUST00000082387.1,mt-Tf,chrM,0,68,+              0.2844       0.2412
ENSMUST00000082388.1,mt-Rnr1,chrM,69,1024,+       573.5912     486.5282
ENSMUST00000082389.1,mt-Tv,chrM,1024,1093,+         0.0000       0.0000
ENSMUST00000082390.1,mt-Rnr2,chrM,1093,2675,+     566.9014     480.8539
ENSMUST00000082391.1,mt-Tl1,chrM,2675,2750,+        1.5471       1.3123
ENSMUST00000082392.1,mt-Nd1,chrM,2750,3707,+     7064.8704    5992.5240

# check row count
nrow(d)
[1] 47285

# make group array
group <- factor(c("treatment","treatment","treatment","control","control","control"))
# relevel to force "control" as baseline reference (if using "~ group", otherwise first factor after alphabetically sorting will be used as ref)
group <- relevel(group, ref="control")
# make DGE object
dge <- DGEList(counts = d, group=group)
# remove low count genes
keep <- filterByExpr(dge)
y <- dge[keep,,keep.lib.sizes=FALSE]
# check row count
nrow(y)
[1] 10396

# make the experimental design
design <- model.matrix(~0 + group)
# use the estimateDisp function to estimate dispersion
y <- estimateDisp(y, design)
# fit a negative binomial model to the data using the estimated dispersion
fit <- glmFit(y, design)

# the LRT tests whether there is evidence for differences in gene expression between groups
lrt <- glmLRT(fit)
# I DO GET THIS WARNING MESSAGE, but I think it's not that critical compared to my main issue, but please enlighten me if this might be the root of it
Warning message:
Zero sample variances detected, have been offset away from zero 

# examine the results
nrow(topTags(lrt, n = 1000000, adjust.method = "BH", sort.by = "logFC", p.value = 1))
[1] 10396
nrow(topTags(lrt, n = 1000000, adjust.method = "BH", sort.by = "logFC", p.value = 0.01))
[1] 15
topTags(lrt, n = 1000000, adjust.method = "BH", sort.by = "logFC", p.value = 0.01)
                                                    logFC   logCPM         LR
NM_009983,Ctsd,chr7,142375909,142387827,-       -7.634901 12.29709    4256063
NM_018853,Rplp1,chr9,61913282,61914510,-        -7.633497 12.29849    5402128
NM_021278,Tmsb4x,chrX,167207093,167209218,-     -7.541610 12.39035   82863006
NM_001130180,Tnnt2,chr1,135836333,135852268,+   -7.452161 12.47978  163050533
NM_001289785,Cryab,chr9,50752970,50756635,+     -7.310952 12.62095  299950423
NM_010861,Myl2,chr5,122100950,122106854,+       -7.102323 12.82953  527672659
NM_009608,Actc1,chr2,114047283,114052875,-      -7.048312 12.88353  592028784
NM_009696,Apoe,chr7,19696243,19699188,-         -6.969890 12.96194  689728337
NM_010240,Ftl1,chr7,45457943,45459886,-         -6.927231 13.00459  745072485
NM_001164248,Tpm1,chr9,67027889,67049213,-      -6.905155 13.02666  774340353
NM_010239,Fth1,chr19,9982645,9985111,+          -6.795993 13.13581  925594488
NM_008725,Nppa,chr4,148000721,148002074,+       -6.557433 13.37433 1297094908
ENSMUST00000082421.1,mt-Cytb,chrM,14144,15288,+ -6.446244 13.48551 1491403309
ENSMUST00000082392.1,mt-Nd1,chrM,2750,3707,+    -6.315453 13.61628 1739006348
ENSMUST00000082402.1,mt-Co1,chrM,5327,6872,+    -5.833678 14.09801 2857628872
                                                PValue FDR
NM_009983,Ctsd,chr7,142375909,142387827,-            0   0
NM_018853,Rplp1,chr9,61913282,61914510,-             0   0
NM_021278,Tmsb4x,chrX,167207093,167209218,-          0   0
NM_001130180,Tnnt2,chr1,135836333,135852268,+        0   0
NM_001289785,Cryab,chr9,50752970,50756635,+          0   0
NM_010861,Myl2,chr5,122100950,122106854,+            0   0
NM_009608,Actc1,chr2,114047283,114052875,-           0   0
NM_009696,Apoe,chr7,19696243,19699188,-              0   0
NM_010240,Ftl1,chr7,45457943,45459886,-              0   0
NM_001164248,Tpm1,chr9,67027889,67049213,-           0   0
NM_010239,Fth1,chr19,9982645,9985111,+               0   0
NM_008725,Nppa,chr4,148000721,148002074,+            0   0
ENSMUST00000082421.1,mt-Cytb,chrM,14144,15288,+      0   0
ENSMUST00000082392.1,mt-Nd1,chrM,2750,3707,+         0   0
ENSMUST00000082402.1,mt-Co1,chrM,5327,6872,+         0   0

Please let me know if there's anything else I may have not considered and left out.

Thanks for your time in advance!

Robert

edgeR DESeq Differential-Expression RNA-Seq • 1.6k views

ADD COMMENT • link updated 4 months ago by Gordon Smyth ★ 7.0k • written 4 months ago by raplayer ▴ 10

1

Entering edit mode

Can we take it step by step, because there is a lot of inaccuracy in your post.

I need to use spike-in normalized RPKM data, so from my understanding of DESeq, it would not be appropriate for this analysis because it requires raw reads to perform it's own TMM normalization before building it's model to test for differential expression.

DESeq (or DESeq2? what you mean) does not use a TMM normalization, edgeR does, and for both tools you could theoretically skip the normalization step. And for both tools RPKM is inappropriate as this normalization distorts the so-called mean-variance trend which the tools take their power from when analysing experiments with low n such as your 3 vs 3.

with either the design ~0 + group (exclude intercept) as well as ~ group (include intercept using baseline "control" as reference),

That's the same, just that for an intercept-less design you have to use contrasts, but the results are precisely the same if done properly.

The reason for the need to avoid library normalization after spike-in normalization is due to the nature of the expected data again having a global increase/decrease in gene expression after some treatment. If the samples are normalized for library size again after accounting for the spike-in, these differences would be masked.

Here comes the problem. FIrst of all, if you know that there is global shift, why even doing a gene-wise differential analysis. The goal of DESeq2/edgeR is to inform about differential expression of individual genes, not assessing global shifts. What the tools give you is an assessment of differential expression relative to your baseline which you normalize to. I don't really see the point here.

If you want to show global shifts you probably do not even need such an analysis but could jsut use the spike-in normalized data and show some sort of plot that visualizes the mean of all genes in condition 1 vs condition 2. That will tell whether there's a global shift.

design <- model.matrix(~0 + group)

This part is competely wrong. With intercept-less design and without using contrasts you're just testing whether expression is zero or not, and this for most genes is true, hence the large logFCs and pvalues being 0. Read the user guide on how to use contrasts.

I do not really know what to advise here. What exactly is the analysis goal, so which question do you biologically want to answer?

ADD REPLY • link 4 months ago by Alexander ▴ 30

0

Entering edit mode

Thanks Alexander for your comment, I appreciate your engagement on the topic.

I must have mixed those normalization methods, thanks for the correction. At any rate, for edgeR, I'm not running the normLibSizes() function on the DGEList, so that normalization is not being applied. As for DESeq/2, I didn't see a parameter in their manual for explicitly disabling it's internal normalization (it's possible I overlooked it, so please let me know if there is one).

OK, wrt to the designs being essentially the same if done properly. This is what I suspected given the default behavior to use the alphabetically first factor in the group as the baseline reference (but then you need to apply makeContrasts().

Thank you for clarifying the behavior of using ~0 + group without contrasts, this makes sense now.

Essentially, the lab is asking for a list of DEGs from an experiment where global expression is expected to increase due to some treatment vs control. Would a PCA or some other clustering method be better for this, since it takes into account all genes per sample, then you can compare/contrast group clusters visually or by calculating some distance metric? Or would the sample compositions be essentially the same? If I recall correctly, clustering typically uses library normalized data as well, so there may be little variation if only the magnitude of expression (of raw data) is different between sample groups.

Do you have any other thoughts or suggestions for how to approach this data?

ADD REPLY • link 4 months ago by raplayer ▴ 10

0

Entering edit mode

To add to my initial response to your comment, the objective is to return a list of genes with associated log2FC reported with some statistical significance test metric from an experiment where global expression is expected to increase due to some treatment vs control.

Given the example data above, how would you go about doing this? Would you simply use the spike-in adjusted raw data in edgeR or DESeq without a normalization step with the appropriate contrasts?

ADD REPLY • link 4 months ago by raplayer ▴ 10

score 1 · Answer 1 · 2023-12-14

1

Entering edit mode

4 months ago

Gordon Smyth ★ 7.0k

It is possible to use spike-in normalization in order to conduct DE analyses with edgeR when the majority of genes might change in the same direction. The method is to use the spike-ins to set the edgeR normalization factors, after which the DE analysis can be conducted as usual (but omitting the TMM normalization). The normalization factors are set so that the spike-in RNA is not DE. See the following paper where we normalized to drosphila spike-in for the CUT&Run and RNA-seq analyses:

https://www.nature.com/articles/s41419-022-05055-6

ADD COMMENT • link 4 months ago by Gordon Smyth ★ 7.0k

0

Entering edit mode

Thanks for the comment Gordon.

The data have already been normalized to spike-in, so that's not the issue. I'll check out the paper to see how the method worked "when the majority of genes might change in the same direction" when comparing groups.

And to reiterate my other comment, the objective is to return a list of genes with associated log2FC reported with some statistical significance test metric from an experiment where global expression is expected to increase due to some treatment vs control.

ADD REPLY • link 4 months ago by raplayer ▴ 10

0

Entering edit mode

Thanks again for the link Gordon.

The methods in (https://www.nature.com/articles/s41419-022-05055-6#Sec2) are pretty sparse with respect to both spike-in normalization and how exactly differential expression was performed, stating only that "Statistical analysis was carried out using the Prism v8.0a software. Statistical methods are stated in the figure legends. Data are presented as mean +/-SEM, unless indicated otherwise." Am I overlooking a passage where this is described in detail?

I did find your reference to this paper, "We performed RNA-seq on iC-TIP60 U2OS cells 4 days after induced TIP60 mutation adding in a D. melanogaster spike-in cells to enable detection of global changes in mRNA, as previously described [70].", but again there's no detailed explanation of spike-in normalization (not what I'm looking for anyway) or differential expression for the case where a majority of genes are displaying increased expression (exactly what I'm looking for). That [70] citation then goes on to cite Lovén et al., 2012, which also does not describe the case of running DE where a majority of genes are increased in one of the sample groups.

It seems to be the case that all current DE methods are based on the assumption that only a small percentage (say, <10%) of genes will be significantly up/down regulated, and do not account for this particular result where a majority of genes are displaying increased expression.

What am I missing about this particular analysis of the expected result (majority of genes are displaying increased expression in one group) that is being ignored by the literature?

ADD REPLY • link 4 months ago by raplayer ▴ 10

1

Entering edit mode

Some of the things you are saying are not correct.

First, you cannot input normalized data to edgeR. edgeR only works on raw counts. If you are inputing RPKM values to edgeR, then that would be quite wrong.

The paper that I refered you to includes a detailed description of the DE analysis. The Supplementary Methods section of the paper includes over 5 full pages describing the RNA-seq and CUT&Tag analyses including the spike-in normalizations. The text you have quoted about Prism and SEs does not relate to the RNA-seq analysis.

The issue of global DE changes has been discussed many times in the literature and many times on the Bioconductor help forum. It is not an "ignored" issue. At the same time, it is not a major priority because, as Alexander has pointed out, there is not usually much scientific interest in conducting genewise DE analyses when it is known a priori that almost every gene is up (or down) regulated.

You seem to be assuming that the solution is with the DE analysis method, but it is not. DE methods do not make any assumptions about how many genes are up or down regulated.

The problem is rather one of normalization. Global changes cause an issue for normalization because most normalization methods assume that most genes are not DE. The TMM normalization method used in edgeR can handle about 30% unbalanced changes in one direction. If the proportion of genes changing in one direction is much greater than that, without being offset by balancing changes in the other direction, then you need to normalize the library sizes using spike-ins, as I explained in my answer above.

You say in your question that you have "spike-in normalized RPKM data". If you know how to compute such quantities, then you must know how to compute spike-in normalized library sizes. And that is all you need in order to conduct an edgeR analysis.

Finally, let me say that I only recommend spike-in normalization when there are overwhelming global DE changes in one direction. In less extreme circumstances, TMM or TMMwsp normalization are much better. Spike-in normalization is intrinsically noisy and leads to very much less precise analyses than a regular RNA-seq analysis using global normalization methods.

ADD REPLY • link 4 months ago by Gordon Smyth ★ 7.0k

0

Entering edit mode

Thanks again Gordon, especially for the bioconductor topic links, those were helpful.

For that first link, is it accurate that there is only a single sequence that is spiked in to the samples, or is the spike_in_factor a sum of all the spike-in aligned reads? And the below code block results in a single "norm factor" per sample, correct?

y <- DGEList(counts = counts_without_spike_in, samples = mapping_file)
norm.factors <- spike_in_factor / y$samples$lib.size
norm.factors <- norm.factors / prod(norm.factors)^(1/length(norm.factors))
y$samples$norm.factors <- norm.factors

If so, my approach is a bit different. I could of course switch to this aggregate method to generate a single spike_in_factor value for calculating a norm.factor per sample. However, my current method for spike-in normalization of raw and RPKM count data is first generating a curve of expected ERCC ExFold mix1 (containing about 90 sequences) molecules vs observed ERCC molecule counts (read counts), then fitting the curve with a linear regression and using that function for transforming the target organism transcript counts.

It sounds like I should use the transformed raw counts as my data matrix input to either method while disabling the internal normalization that occurs when running, which looks to be achieved through leaving out the calcNormFactors(y) step for edgeR, and setting dds$sizeFactor <- 1 for DESeq. Do you think this will achieve statistically sound results for the case where a majority of genes might be up-regulated in a treatment group compared to a control?

Regarding the paper, under the Methods section, subheading "Statistical Analysis" is where I pulled that quote from. I didn't see anything in that section regarding DE analysis, and when I search the paper for "supplementary", I only get 3 hits and nothing in the "Supplementary Information" section suggests any of those files has anything to do with a description of methods. Could you please describe where the "Supplementary Methods" section is, or a link to it?

ADD REPLY • link 4 months ago by raplayer ▴ 10

0

Entering edit mode

is it accurate that there is only a single sequence that is spiked in to the samples

No, I have never known anyone to do that.

or is the spike_in_factor a sum of all the spike-in aligned reads?

For this code, yes. You can either normalize to the sum of the spike-in reads or run normLibSizes on the spike-in reads as a subset. Both methods work, but I prefered the first for the drosphila spike-in.

And the block results in a single "norm factor" per sample, correct?

Yes, naturally, that always has to be true. The norm.factors modify the library sizes, and there is one library size per sample.

I could of course switch to this aggregate method to generate a single spike_in_factor value for calculating a norm.factor per sample.

That is obviously what I am advising you to do.

However, my current method for spike-in normalization of raw and RPKM count data is first generating a curve of expected ERCC ExFold mix1 (containing about 90 sequences) molecules vs observed ERCC molecule counts (read counts), then fitting the curve with a linear regression and using that function for transforming the target organism transcript counts.

That sounds like overkill. I guess you are trying to optimize the quantification of expression in an absolute sense, but what you are doing isn't well suited to a DE analysis. RPKMs are a very poor summary for a DE analysis, see for example:

https://support.bioconductor.org/p/56275/

If you want to read about state-of-the-art strategies for using ERCC to normalize edgeR analyses, see

https://www.nature.com/articles/nbt.2931.

It sounds like I should use the transformed raw counts as my data matrix

I don't know whether you are refering to RPKM values (which I have told you are not suitable) or something else, but edgeR does not analyse transformed counts.

Do you think this will achieve statistically sound results for the case where a majority of genes might be up-regulated in a treatment group compared to a control?

No.

Regarding the paper, under the Methods section, subheading "Statistical Analysis" is where I pulled that quote from. I didn't see anything in that section regarding DE analysis, and when I search the paper for "supplementary", I only get 3 hits and nothing in the "Supplementary Information" section suggests any of those files has anything to do with a description of methods. Could you please describe where the "Supplementary Methods" section is, or a link to it?

Why in the world would you keep reading the "Statistical Analysis" section when you realize yourself that it has nothing to do with RNA-seq or DE?

If you want to read about the RNA-seq analysis, then you go to the Methods section on RNA-seq. The RNA-seq Methods section of the main paper says: "RNA-seq was performed ... with D. melanogaster S2 cell spike-in. Details of the RNA-seq procedure can be found in the supplement." so you go to the supplement. You click on "Supplementary Information" and the very first file in the list contains the supplementary methods. The first file is a pdf, all the other files are movies or Excel spreadsheets, so there is no ambiguity about which file contains the Supplementary Methods. Surely it wouldn't take you more than a few seconds to download the file and open it up in a pdf reader to check whether it contains what you want? If that is too confusing, then here is a direct link:

https://static-content.springer.com/esm/art%3A10.1038%2Fs41419-022-05055-6/MediaObjects/41419_2022_5055_MOESM1_ESM.pdf

ADD REPLY • link 4 months ago by Gordon Smyth ★ 7.0k

0

Entering edit mode

Thanks for your responses Gordon, I'm not sure where all the salt is coming from, I've been nothing but inquisitive until the end of this response, I'm here to learn after all :)

I don't know whether you are refering to RPKM values (which I have told you are not suitable) or something else, but edgeR does not analyse transformed counts.

I should have said "normalized" instead of "transformed" here, sorry for the confusion. Given that change, would you revise your "No." answer to the next question?

The first file is a pdf, all the other files are movies or Excel spreadsheets, so there is no ambiguity about which file contains the Supplementary Methods

Oh! How silly of me to not know that the file named "Supplemental Figures S1-12 and Table S1" contains a methods pdf...

I appreciate the explanation of where the supplementary methods are, however, I'm not sure why you're giving me grief about the fact that the link between what is mentioned under the Methods > RNA-Seq section is not clear at all (it should cite the file name at least), and the name of the supplemental file where those methods are does not give the slightest indication it would contain them.

ADD REPLY • link 4 months ago by raplayer ▴ 10

0

Entering edit mode

Thanks for your responses Gordon, I'm not sure where all the salt is coming from, I've been nothing but inquisitive until the end of this response, I'm here to learn after all :)

I appreciate the friendly tone of your posts but I am nevertheless finding this thread very frustrating. You came to this thread with the incorrect assumption that you should perform your own in-house count normalization and then analyze the normalized counts in edgeR (something that is fundamentally incompatible with edgeR). You don't seem to have budged an inch from this position because you restate it unchanged in each new comment. So I would respectfully have to say that your posts do not show evidence of learning from the advice that you have received.

I have made a rather big effort to help you. I explained what you needed to do. I cited two published papers that did what you want to do. I also refered you to previous support forum posts including example code. We are now a week into this thread, but you haven't tried what I suggested or read the papers that I cited.

I should have said "normalized" instead of "transformed" here, sorry for the confusion.

There was no confusion. There is no difference in meaning between the words "normalized" and "transformed" in this context.

If you want to read my views about this terminology, see How do you generate TMM normalized counts using EdgeR?

Given that change, would you revise your "No." answer to the next question?

No.

the link between what is mentioned under the Methods > RNA-Seq section is not clear at all (it should cite the file name at least)

You are asking for something is not journal style. Feel free to write to the journal and tell them you're not happy with their style. As authors, we have no control over it.

the name of the supplemental file where those methods are does not give the slightest indication it would contain them

I agree that the name of the supplemental file is misleading. I'm not the corresponding author of the paper so I had no control over it. It is also unfortunate that almost all the Methods material was relegated to Supplementary Information because of space limitations. I thought that this is was not such a serious problem that it would deter a determined reader, but I understand that you might have a different view.

ADD REPLY • link 4 months ago by Gordon Smyth ★ 7.0k

0

Entering edit mode

The edgeR User's Guide is available from

https://bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf

Section 2.8 of the Guide is titled "Normalization" and says (amongst another things):

Note that normalization in edgeR is model-based, and the original read counts are not themselves transformed. This means that users should not transform the read counts in any way before inputing them to edgeR. For example, users should not enter RPKM or FPKM values to edgeR in place of read counts. Such quantities will prevent edgeR from correctly estimating the mean-variance relationship in the data, which is a crucial to the statistical strategies underlying edgeR.

ADD REPLY • link 4 months ago by Gordon Smyth ★ 7.0k

0

Entering edit mode

Yep, understood. And I haven't "budged an inch" because I think my "overkill" method is valid, believe there's still some confusion in what I'm doing regarding the specific data I'll be using in edgeR, and I want to ensure that I'm using either edgeR or DESeq in the appropriate manner for the input I'll have at the end of that process.

So, at the risk of being somewhat repetitive, and to be as clear as I can: I will not be using RPKM data (that was cleared up in both yours and Alexander's first comment), but I will be normalizing the raw counts based on the "overkill" spike-in method before inputing to edgeR (or DESeq), and will disable the internal library size normalization (skipping calcNormFactors(y) step for edgeR, or setting sizeFactors(dds) <- 1 prior to running DESeq).

Given this additional clarity, do you still see an issue with this approach?

ADD REPLY • link 4 months ago by raplayer ▴ 10

0

Entering edit mode

I don't fully understand what you mean by "normalizing the raw counts" or exactly what analysis you are doing in edgeR, so I cannot give you approval for it. As already discussed, it doesn't sound right to me. As a general rule, if you want to work with normalized counts, then doing the downstream analysis is limma is generally better than trying to do it in edgeR or DESeq2, because the normalized counts are not directly proportional to the raw counts across samples so the mean-variance relationship assumed by edgeR or DESeq2 will be destroyed. This problem is not limited to RPKM but to any transformation that changes the size of the counts.

Do you appreciate that the library sizes still enter as offsets into the edgeR linear models even if you omit normLibSizes?

Anyway, I will not post again. Good luck with your analysis.

ADD REPLY • link 4 months ago by Gordon Smyth ★ 7.0k

1

Entering edit mode

Some discussions on the Bioconductor support site, see particularly the first one for the use of spike-in:

ADD REPLY • link 4 months ago by Gordon Smyth ★ 7.0k