I have the impression that FPKM as calculated by cufflinks doesn't correctly normalize for library size in case of unequal mapping rates between samples (e.g. in case of contamination). I get this impression from real data.
As a hypothetical example let's say that we have two libraries A,B, B 10 times the number of reads as A. One might expect, (I know it's a rough estimate) for a large number of genes in B to have 10 times the number of reads as in A, but mostly equal FPKM except for lowly expressed transcripts. Now assume A has ~100% mapping rate, and B only 50% (due to contamination, 40% will align to other references if known). Then, many transcripts in B still seem to have 10X as many reads mapped as in A (maybe because number of mapped reads is not influenced linearly by contamination?), but the effective aligned library size that cufflinks sees is only 50% of the true library size. That would lead to an average 2-fold difference in median of all FPKM values.
Reference size is not taken into account for FPKM, is it? If I knew all contaminant genomes beforehand, I could just add them to the reference, and achieve a much more homogeneous mapping rate and better normalization between samples. On the other hand this might not be a fair comparison, as the contaminants would never have a chance of getting a FPKM assigned and will not influence the calculated average FPKM.
Is this correct from your experience, how to best compensate for it:
- Add all possible contaminants to reference before alignment?
- Use the sequenced library size instead of aligned library size?
- Assume equal mapping rate for all samples (e.g. median, would be similar to median scaling of FPKM)?
- Apply quantile normalization to FPKM data?
- FPKM is already normalized, don't apply double normalization....