There are multiple bias engaged in RNAseq experiment :
RNA population composition for each condition and
genes GC composition
Two bias can be discard if you compare genes amongst conditions only, because these two are inherent to the gene :
genes length and
genes GC composition
genes length : The raw count of two genes cannot be face off if gene A is twice longer than gene B. Due to its length, the longest gene will have much chance to be sequenced than the short one. And in the end, for the same expression level, the longest gene will get more read than the shortest one (pub1)
genes GC composition : I did not get the full explanation of this bias. For two genes with different GC content, the one with the closest GC content to 40% will be more sequenced than the other one. (pub2)
The others bias are "technical bias", due to your sample and sequencing method.
library size : the most well know bias. You create two libraries for two conditions with the same RNA composition. The second library works way better than the first one, you got 12 000 000 reads for condition A and 36 000 000 reads for condition B. You will have three times (36 000 000/12 000 000 = 3) more of each RNA in your condition B than your condition A. (pub3)
RNA population composition for each condition : This one is more tricky. Let's say you have again two conditions A and B. For each condition, you want to study 4 genes and you want 90 reads (by condition)
Biologicaly, in your condition A, you got 3 genes expressed the same way (Gene1, Gene2 and Gene 3), arbitrary unit of 2, and you also got a gene (Gene 4) at 24 which is 12 times more expressed than the three others. In condition B, you also got these 3 genes expressed the same way at 2 but Gene 4 is not expressed at all.
In your desing, you want 90 reads for each conditions (A and B). Reads will be spread out according to the expression level. So, in condition A you have 12 times more reads on Gene 4 than on the 3 others (72/6 = 12). The funny thing is that in condition B, you also have 90 reads to spread, but this time, Gene 4 is not expressed. The reads will be spread out over the three genes left (Gene1, Gene2 and Gene 3).
You knew that the expression level were similar for Gene 1 for condition A and condition B. Expression level for gene 1 in condition A is 5 times smaller than expression level for gene 1 in condition B, biased by the miss of Gene 4.
To reduce these bias, there are a lot of method to normalize RNAseq data.
Those which I call naive ones :
- Total count
- Upper Quartile
- RPKM (Reads Per Kilobase per Million, which is not solid enought for cross condition experiment, pub4 & pub5)
Those with a statistical power :
For the batch effect
RLE method (Relative log Expression) like DESeq2
TMM method (Trimmed Mean of M values) like edgeR
Plus, the most used rule to normalize gene count :
negative binomial distribution (edgeR, DESeq2)
Add to that a multiple testing correction, to output strong express genes (DESeq2)
I would say that, for the same amount of money, you better create replicates over a better gene covery.
This is my naive understanding of the subject, be free to correct what I said here.
Useful links (one is in french sorry) : pub6, pub7
Other links : pub8, pub9, pub10, pub11, pub12