Inexplicable Cufflinks output FPKM variation
1
1
Entering edit mode
9.3 years ago
Tobias ▴ 150

When we are using the cuffnorm program from the cufflinks suite and have a gtf file (say, x.gtf) and three BAM files (1.bam, 2.bam and 3.bam) we can call

cuffnorm x.gtf 1.bam 2.bam

and

cuffnorm x.gtf 1.bam 3.bam

When we consider the gene FPKM values, we obtain for the two calls outputs like

tracking_id q1_0 q2_0
ENSG00000000003.10 2.59667 32.8815
ENSG00000000005.5 0 0
ENSG00000000419.8 68.1701 76.2395
...

and

tracking_id q1_0 q2_0
ENSG00000000003.10 2.76372 14.1348
ENSG00000000005.5 0 0
ENSG00000000419.8 72.5559 38.017
...

Of course, the last gene expression column is related to two different BAM files, which explains its variation. However, the first gene expression column always corresponds to 1.bam, so in principle, we would expect it to be identical for both outputs. We do see, however, some variation.
Our questions are now: Why is that so? Is there some way to bypass that?

Many thanks for your help in advance!

rna-seq cufflnks cuffnorm • 3.0k views
ADD COMMENT
0
Entering edit mode
9.3 years ago

This is a feature, not a bug.

You should not expect the same values for a single sample when you perform normalization with other samples. In fact, that would defeat the whole purpose of normalization. The purpose of cuffnorm is to produce normalized expression values that can directly be used for statistics. To do that, it has to normalize the library sizes of the samples to each other. Since this process will necessarily incorporate all of the samples input, the output will necessarily vary with changes in the input samples.

ADD COMMENT
0
Entering edit mode

Hi Devon, many thanks for you comment.

Is there a way to obtain "abolute" FPKM values for each RNA-seq BAM file? If necessary, we might take another program than cufflinks, though after all I have heard cufflinks is pretty good for such purposes.

ADD REPLY
0
Entering edit mode

You could just change the library normalization method to "classic-fpkm". Keep in mind that you then can't directly use the values for statistics.

ADD REPLY
0
Entering edit mode

Hi Davon, what is meant here by statistics?

ADD REPLY
0
Entering edit mode

Most people doing RNAseq want to look at differential expression and things like that. You can't reliably do those things (i.e., perform any comparative statistics) on raw FPKMs.

ADD REPLY
0
Entering edit mode

That's very interesting. I have to analyze roughly 1000 RNA-seq datasets and each BAM file has a huge size (median 20 GB). What would be your way to the analyze all these files?

ADD REPLY
0
Entering edit mode

It depends on the organism.

ADD REPLY
0
Entering edit mode

It is human, hg19 annotation. Or would you suggest an entirely different software with which I can perform a gene expression analysis. In the end, after I processed these data I want to compare them to other already processed FPKM data from the TCGA consortium.

ADD REPLY
0
Entering edit mode

If you intend to do the typical differential expression analysis, the just run featureCounts on the BAM files. You'll get the raw counts needed for downstream statistics from that.

ADD REPLY

Login before adding your answer.

Traffic: 3001 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6