I have the impression that FPKM as calculated by cufflinks doesn't correctly normalize for library size in case of unequal mapping rates between samples (e.g. in case of contamination). I get this impression from real data.
As a hypothetical example let's say that we have two libraries A,B, B 10 times the number of reads as A. One might expect, (I know it's a rough estimate) for a large number of genes in B to have 10 times the number of reads as in A, but mostly equal FPKM except for lowly expressed transcripts. Now assume A has ~100% mapping rate, and B only 50% (due to contamination, 40% will align to other references if known). Then, many transcripts in B still seem to have 10X as many reads mapped as in A (maybe because number of mapped reads is not influenced linearly by contamination?), but the effective aligned library size that cufflinks sees is only 50% of the true library size. That would lead to an average 2-fold difference in median of all FPKM values.
Reference size is not taken into account for FPKM, is it? If I knew all contaminant genomes beforehand, I could just add them to the reference, and achieve a much more homogeneous mapping rate and better normalization between samples. On the other hand this might not be a fair comparison, as the contaminants would never have a chance of getting a FPKM assigned and will not influence the calculated average FPKM.
Is this correct from your experience, how to best compensate for it:
- Add all possible contaminants to reference before alignment?
- Use the sequenced library size instead of aligned library size?
- Assume equal mapping rate for all samples (e.g. median, would be similar to median scaling of FPKM)?
- Apply quantile normalization to FPKM data?
- FPKM is already normalized, don't apply double normalization....
What's the goal? Are you trying to get some sort of contaminant-normalized FPKM such that equal expression between samples would produce different number if they have differing levels of contamination?
BTW, no, reference size isn't involved in FPKM calculation. Depending on your goals, you might find something like the way edgeR calculates fpkm (i.e., it does the normal library normalization and uses the resulting normalization factor in the fpkm calculation rather than the total mapped reads). Some variant of that might serve your needs better.
Hi, the goal is to get comparable expression values between libraries such that the values become independent of contamination/mapping rate, for clustering.
I am switching to log2 transformed CPM/TTM (edgeR) normalized counts. The distributions and quantiles of each library look much more homogeneous and "normal-like" than RPKM values.