Question

Is total miRNA in EV constant? Implications for library size normalization.

1

Entering edit mode

4 days ago

Thomas.H.Hampton ▴ 10

I often see studies using tools like DEseq2 or edgeR used to normalize reads in samples to quantify miRNA expression. Technical artifacts are a problem, and I understand why people studying miRNA in extracellular vesicles (EV) would be tempted to use tools that have been successfully used in other domains. However, I've always found the underlying assumption that mRNA production by cells is constant regardless of treatment a little surprising, and library size adjustment by edgeR and DESeq2 requires that assumption to hold. Maybe cells just churn out a given amount of mRNA no matter what is going on.

It feels less likely to me that cells produce miRNA at a constant rate no matter what, and quite unlikely that cells package a fixed amount of miRNA into EV not matter what. Do others share my suspicion that total miRNA in EV often varies as a function of treatment? If it does, then library size normalization is invalid for this type of data.

rna-seq • 6.0k views

ADD COMMENT • link updated 18 hours ago by Gordon Smyth ★ 8.3k • written 4 days ago by Thomas.H.Hampton ▴ 10

score 4 · Answer 1 · 2025-09-12

4

Entering edit mode

4 days ago

Gordon Smyth ★ 8.3k

Library size normalization in edgeR does not asume that mRNA production is constant. On the contrary, the edgeR normalization factors can be interpretted as estimating changes in overall mRNA production.

The actual assumption is that most genes (over 50%) are not DE. Consider a toy example with 10 genes. Suppose that nine of the genes have constant expression between conditions but gene 10 is enormously up-regulated in the second condition, so much so that total mRNA production over all genes in the second condition is doubled. If you (wrongly) assumed that mRNA production was constant, you would have to conclude that genes 1-9 are all downregulated in the second condition and only gene 10 is up. edgeR instead will infer that total mRNA production is up in the second condition, and will adjust the library sizes so that genes 1-9 are roughly constant and only gene 10 is changed.

I wonder where you have read that mRNA production is assumed constant. No such statement has ever been made by the edgeR authors. Similar comments would apply to DESeq2.

ADD COMMENT • link 4 days ago by Gordon Smyth ★ 8.3k

0

Entering edit mode

I just wanted to add that my answer at Why are we assuming genes are not differentially expressed? has a toy example illustrating Gordon's answer.

ADD REPLY • link 2 days ago by dariober 15k

0

Entering edit mode

Thanks very much! As an aside, I'm impressed that I am taking to the Gordon Smyth. I feel like I'm talking to a celebrity!

As I understand it, RNA-seq works something like this. Isolate RNA from a fixed amount of tissue (grams) or cells (count). Assess RNA concentration in each sample, and use this concentration to aliquot a fixed amount for RNA for reverse transcription into DNA. Quantify DNA concentration so that the same amount of DNA from each sample is used for library preparation. Assess final DNA concentration and load the same amount of DNA for sequencing. Adjust read counts so that each sample has about the same number of reads.

As you say, one certainty could use library size correction factors from edgeR as a proxy for biologically interesting differences in global gene expression as a function of experimental design, especially if one takes all the other dilution factors into account. I don't recall seeing dilution factors or the library size factors reported, let alone highlighted. I think as a community we see these as technical embarrassments that need to be swept under the rug ASAP.

I must have read hundreds if not thousands of papers on global gene expression analysis over the last 20 years, and I don't recall one stating that a particular drug or condition increases or decreases expression across the board. I think our implicit understanding of gene expression is closer to the concept of "housekeeping genes" where it was baselessly assumed that there are certain genes whose expression never varies no matter what is going in a cell. Assuming that fewer than 50% of genes are significantly differentially expressed seems much less problematic than trying to find a perfect reference gene...

But consider this thought experiment. Suppose I discovered a drug that globally increases the efficiency of RNA polymerase II such that exposed cells produce about twice as much messenger RNA from each gene across the board. I perform a standard RNA-seq workflow and report results as they are typically reported. Would my report include this systematic shift? I doubt it.

ADD REPLY • link 2 days ago by Thomas.H.Hampton ▴ 10

0

Entering edit mode

Experiments with global changes in the RNA production usually need spike-in normalization to estimate the library sizes. Questions about this scenario have been asked a number of times on the Bioconductor and Biostars forums.

My collaborators use drosophila spike-in to handle this extreme situation in mRNA analyses, and we then use the ratios of mouse to drosophila reads to normalize the library sizes. Spike-in normalization is very noisy however, and it only good for general conclusions. What do you expect to learn from a DE analysis in such a context anyway? You will just learn that everything is up.

You seem to be making a lot of assumptions about RNA-seq, which I do not think are correct. I am not a wetlab expert at all, but in a regular mRNA-seq analysis there is no requirement for a fixed amount of tissue or cells, or a fixed amount of RNA, or for exactly the same amount of DNA. Having roughly equal quantities would be ideal, but the downstream analysis does not assume equality and is not compromised if the amounts are not equal. Certainly there is no adjustment of read counts and no requirement for similar numbers of reads. All that is required is that there is enough RNA to start with, and that the amounts are within the operating characteristics of the protocol.

If you use spike-in, then the situation is different. Then the amount of spike-in needs to be carefully tied to the sample RNA volume or cell numbers somehow. Perhaps the RNA-seq protocol you are describing includes spike-ins, in which case the steps you describe might be part of that.

Nevertheless, I agree with your original suggestion that the usual edgeR normalization assumptions might not be suitable for studying miRNA in extracellular vesicles. Micro RNAs are hard to start with, and extracellular would be worse. I don't have any experience with that, and can't make any recommendations. RNA-seq is a fantastic powerful system, but the DE analyses are designed for situations where you have results for a sizeable part of the transcriptome and the changes are not global.

ADD REPLY • link 18 hours ago by Gordon Smyth ★ 8.3k