I don't have much experience with RNA-Seq but I am seeing that the data is usually published not in raw counts but in FPKM values. What is the reason for that? Is it only because so that we can model the values by a log-Gaussian distribution rather than a discrete distribution like Poisson or negative binomial? Or does it have any purpose to make data more accurate and reliable?
The reason for FPKM is mostly historical as there are practically only disadvantages in distributing the data this way.
- There are several posts and publications showing that FPKM is inferior to other units.
- FPKM is not directly compatible with most DE packages.
- Providing raw counts would instead allow anyone to compute the transformation they wanted (CPM, TPM, FPKM), while the FPKM transformation is not easily reversible.
- FPKM manifests biases and errors in the gene prediction, especially it is not suitable for draft genomes where the exons are often not well annotated.
- FPKM need to be represented as floating point values, introducing unnecessary rounding errors and maybe data volume, while the counts can be represented by integers.
R(F)PKM/TPM values are used to normalize read counts by library size (total number of reads you have in a given RNAseq experiment) and the length of the feature (gene/transcript). But remember that commonly used software for differential expression analysis (DESEQ2/EdgeR) are using raw counts instead of normalized values (they do their internal normalization steps).