The question confuses several things, as hinted at by the different answers by dariober and karl.stamm.
There are two assumptions often stated in RNAseq, the first is that the majority of genes are not DE. The second is that an equal number of genes are up and down regulated, or more properly, that the mean log-fold change is zero.
The first of these assumptions is without doubt wrong (that doesn't mean we aren't justified in making it). For any random variable X, distributed according to any continuous distribution the probability the probability of X taking any exact value is approximately 0. So P(X=0) ~ 0. In ANY two conditions if you had sufficient replicates every gene would be statistically significantly DE, although the effect size would likely be too small to be interesting in most cases (interesting and significant are distinct concepts).
However, the second of these assumptions is far more important in RNAseq analysis.
The second thing that is being confused is the questions "Why is this necessary?" and "Why are we justified in doing it?"
@dariober's answer lays out the reasons why the assumptions are neccessary, although he confuses assuming most genes don't change with assuming the mean change is zero in describing the solution. edgeR's normalisation method TMM (trimmed Mean of M values) does this literally - it finds the library normalisation factor that makes the trimmed mean log-fold change 0. DESeq works slightly differently, and the assumption is less direct, but it boils down to the same thing.
@RamRS points out that "You already assume that most genes are not DE, which is the assumption OP is questioning." but this is the wrong way to look at the problem. Statistical tests have assumptions. That is not because these assumptions are neccesarily true, but rather that the statistical tests only work if they are. T-tests assume normality, that doesn't mean that any data you do a ttest with is norm, rather that if you data isn't normal, then you can't do a t-test. Thus there is no biological reason to assume that mean log-fold change is zero in a particular experiment. Its just that RNAseq can't possibly work if it isn't true on your particular experiment.
Consider: I have two conditions, A and B and 10 genes. The expression of every gene in condition B is twice that in condition A. Now I sequence 100 reads from each condition. In condition A, each gene gets 10 reads, and in condition B, where gene expression of every gene is twice as high ... each gene still gets 10 reads, because there are only 100 to go around. In this situation RNAseq is simply the wrong tool to understand the changes.
The use of the other assumption (that most genes are not DE) is a little more esoteric. I believe that it is used in calculating the per-gene dispersion value, but I'd really need Mike Love or Gordon Smyth to confirm that.
So if there is no biological reason for these assumptions, why do we continue to make them? This comes down to the fact we are doing a null hypothesis test. In a null hypothesis test, we assume the null hypothesis is true, and then calculate the likelihood of our data under that assumption. In the case of almost all RNAseq experiments, we carry out 20,000 (or however many genes you have) tests of the null hypothesis is that the log-fold change is 0. Thus when we conduct the test we assume the null hypothesis is true. that is how you do a null hypothesis test. Usually in science the null hypothesis should at least be plausible. In fact usually we want our null hypothesis to be more or less certainly true if our scientific hypothesis is not true. Again, we come back to the point that if it is not likely that the log-fold change is zero, then standard rnaseq, as we know it, is the wrong experiment to be doing.
I'm not sure I agree or understand karl.stamm's answer - anyway, here's my take on the question.
With RNAseq (bulk or single-cell or microarrays) you measure the concentration of genes not their absolute number. This means that if a gene is highly expressed in condition A vs B (= higher concentration) the other genes in condition A will have lower concentration relative to B. You can correct for this by assuming that most genes do not change. Here's an example.
Imagine this is the real number of transcripts in 2 pairs of samples (rows are genes columns are samples):
real <- read.table(text="a1,a2,b1,b2 100,100,100,100 100,100,100,100 100,100,100,100 100,100,100,100 100,100,100,100 100,100,100,100 100,100,100,100 100,100,100,100 100,100,100,100 100,100,100,100 100,100,100,100 1000,1000,100,100", sep= ',', header= TRUE)
You see that genes 1 to 11 do not change while gene 12 goes up by 10x in A vs B:
real a1 a2 b1 b2 1 100 100 100 100 2 100 100 100 100 3 100 100 100 100 4 100 100 100 100 5 100 100 100 100 6 100 100 100 100 7 100 100 100 100 8 100 100 100 100 9 100 100 100 100 10 100 100 100 100 11 100 100 100 100 12 1000 1000 100 100
If you sequence these libraries at the same depth of 100000 reads/library, this is what you would see in the table counts:
counts <- round(t(t(real) / colSums(real)) * 100000) counts a1 a2 b1 b2 [1,] 4762 4762 8333 8333 [2,] 4762 4762 8333 8333 [3,] 4762 4762 8333 8333 [4,] 4762 4762 8333 8333 [5,] 4762 4762 8333 8333 [6,] 4762 4762 8333 8333 [7,] 4762 4762 8333 8333 [8,] 4762 4762 8333 8333 [9,] 4762 4762 8333 8333 [10,] 4762 4762 8333 8333 [11,] 4762 4762 8333 8333 [12,] 47619 47619 8333 8333
Note that genes 1-11 appear to have lower expression in A vs B.
This is what happens if you do not assume that most genes stay the same and you only normalize for library size (which is not even necessary here):
counts <- counts + rpois(n= nrow(counts) * ncol(counts), lambda= 20) # Some random noise y <- DGEList(counts= counts, group= c('A', 'A', 'B', 'B')) y <- calcNormFactors(y, method= 'none') y <- estimateDisp(y) et <- exactTest(y) et$table logFC logCPM PValue 1 0.8040710 16.00099 2.887342e-269 2 0.8009565 16.00127 2.949170e-267 3 0.8022661 16.00105 4.338599e-268 4 0.8014513 16.00143 1.341432e-267 5 0.8002031 16.00094 9.970869e-267 6 0.8042200 16.00034 2.733762e-269 7 0.8032335 16.00121 9.661878e-269 8 0.8019855 16.00072 7.207432e-268 9 0.8024803 16.00088 3.277285e-268 10 0.8024539 15.99952 4.859277e-268 11 0.8017492 16.00072 1.028928e-267 12 -2.5084330 18.08930 0.000000e+00
It looks like everything is differentially expressed with genes 1-11 going up and gene 12 going down (but not as much as it should).
With assumption and normalization things look as they should: genes 1-11 do not change
y <- calcNormFactors(y, method= 'TMM') y <- estimateDisp(y) et <- exactTest(y) et$table logFC logCPM PValue 1 1.880650e-03 15.94587 0.9402800 2 -1.233916e-03 15.94630 0.9615296 3 7.571292e-05 15.94602 1.0000000 4 -7.390485e-04 15.94644 0.9783870 5 -1.987294e-03 15.94601 0.9359110 6 2.029644e-03 15.94521 0.9352271 7 1.043084e-03 15.94613 0.9687670 8 -2.047493e-04 15.94571 0.9965996 9 2.899334e-04 15.94584 0.9944249 10 2.635540e-04 15.94448 0.9953280 11 -4.410393e-04 15.94572 0.9885521 12 -3.310633e+00 18.31484 0.0000000
The assumption comes from the experimental design: We're testing similar samples with just the experimental factor changed. Should be the same kind of mice, with or without a drug treatment. Or the same kind of corn with or without extra irrigation.
Consider the converse: If you were to rna-seq and differential expression analysis a sample of corn vs a sample of human skin, where of course every gene is expected to be D.E. then the experiment is pointless and you don't need statistical tooling.
What would it mean to NOT assume most genes are not D.E.? The experimenter expects most genes to be D.E. and will get a spreadsheet of every gene. It's not a good use of resources. So we (software writers) assume the user (experimental designer) has a controlled situation with minimal differences between case/control samples.
There is already at least one good answer to this question, and I will offer a different way of looking at this issue.
Most genes are not differentially expressed because basic functions of life are the same regardless of developmental stages or environmental challenges. This is a concept of
house-keeping genes. Let's say that cells are challenged by some toxic compound. That in most instances will not change anything about mitochondrial structure or protein composition unless the compound is specifically targeting mitochondrial enzymes. The same will likely be true for most of cell membrane constituents, even if cells over-express one or several transporter proteins to pump the compound out.
The point is that most of life's functions are not affected by small to moderate environmental changes, so most genes remain at the same level. Think of driving a car: most things that a car does are the same whether one is driving on a sunny day, or in rain, or during nighttime. We have to turn the headlights on in the dark (turn on a group of "
genes" to help deal with less ambient light) or start the windshield wipers in the rain (turn on the "
genes" that will help get rid of excess water). Yet there will be very little difference in how car burns the fuel, or how that energy from fuel is powering up the wheels, or supplying the electricity for lights, AC or the music system.