"(not looking for DE genes, just expression levels)"
Actually, getting DE genes is much easier than getting expression levels. The relation between counts per gene and the actual amount of mRNAs in the sample is not trivial. When you compare the counts per gene between two samples you assume that the change you see in the counts represents the change in mRNAs because the biases should be the same. Try taking the same example and preparing RNA-seq libraries using two different protocols (we did it), you'll see that some genes (or a lot of genes) will have different counts in the two libraries (and this difference will be reproducible) this means that counts doesn't truly represents the mRNAs.
Following the comments I'll add some more thoughts. I came across a similar situation, luckily for me the total number of reads in the two libraries was about the same so I could just compare the raw counts between the libraries. If the two libraries look very much the same (you can plot the base-by-base correlation) then dividing each count in the total (mapped) number of counts in the library is not a bad idea.
If the libraries are different (below 0.9 or 0.85 Spearman correlation) I wouldn't divide in the total number of reads and would prefer the DESeq normalization method. In a nutshell, you assume that most of the positions in the genome have the same expression levels in the two libraries. For each position you compute Xi/Yi and take the median of these quotients as your normalization factor (multiply Yj by the median for each j in the genome). You are now able to compare Xj with the normalized Yj.
modified 5.9 years ago
5.9 years ago by
Asaf ♦ 8.4k