I am trying to calculate an expression value that are comparable in one sample (so I can compare gene A, gene B, etc in the same sample) and across samples (so I can compare gene A in sample 1 with gene A in sample 2, and so on). My samples are from single-cell RNA sequencing and I've been using the cufflinks suite to estimate abundance. I have a question about the FPKM values generated by cufflinks and cuffnorm.
First I tried estimating abundance using only cufflinks and checking the resulting fpkm_tracking table. This is done for a small subset of my complete samples (2 samples, library sizes of about 30 million reads and 20 million reads respectively). Curious, I then also tried to estimate abundance using cuffnorm, using the same 2 samples. I plotted the resulting FPKM values (on the X axis) against the FPKM values from Cufflinks (on the Y axis) for each sample. I was surprised to find that for both samples, most FPKM values were scaled down.
Here's one example from a gene:
- with cufflinks, sample A: 222.349
- with cuffnorm, sample A: 31.376
- with cufflinks, sample B: 122.469
- with cuffnorm, sample B: 17.333
I see this in several other genes as well, but not all. Here's one that almost does not change at all:
- with cufflinks, sample A: 29.4418
- with cuffnorm, sample A: 26.976
- with cufflinks, sample B: 8.0248
- with cuffnorm, sample B: 8.649
Looking at the mean and median of FPKMs, these are the numbers:
- cufflinks, sample A mean & median: 23.59 & 0.1073
- cuffnorm, sample A mean & median: 17.4 & 0.04278
- cufflinks, sample B mean & median: 24.2 & 0.01044
- cuffnorm, sample B mean & median: 20.94 & 0 (yes, zero)
There it is apparent that there is a global downscaling of the FPKM values.
I tried calculating the Spearman correlation coefficient, and I got 0.9722 for sample A and 0.968 for sample B, which I understand means the samples are quite similar in rank and are positively correlated (P-values are 0 for both samples).
My questions are:
- What's happening here? Why are all my expression values in cuffnorm downscaled relative to cufflinks?
- I may be stretching cufflinks / cuffnorm for my analysis, but my goal here is to use a unit of expression abundance comparable across samples and across genes. In lieu of the actual relative molar concentration, what do you recommend using?
I'm on version 2.2.1 (4237) of the Cufflinks suite, running on Ubuntu 12.04 64 bit, by the way.
Thanks in advance :).