Question

Cufflinks/Cuffdiff giving incorrect statistics between ON/OFF genes

0

Entering edit mode

8.8 years ago

rodrigo.nicodemos82 ▴ 10

Hello all,

I've been using the Tuxedo suite to calculate differential gene expression for Drosophila RNAseq samples I have been analyzing. I've noticed that Cufflinks/Cuffdiff seems to give incorrect statistics between ON and OFF genes.

My pipeline for analyzing differential gene expression was: 1) Align reads with TopHat, 2) Cufflinks to estimate gene/isoform abundance, 3) Merge cufflinks files with Cuffmerge, and 4) Perform differential gene expression test with Cuffdiff (earlier version of cufflinks package).

For both cufflinks and cuffdiff I used the options -frag-bias-correct and -multi-read-correct to improve the accuracy of the abundance of the expressed genes.

Here is an example of how this pipeline seems to miscalculating the expression between ON and OFF genes:

test_id       gene_id       gene      locus                     sample_1   sample_2   status   value_1    value_2   log2(fold_change)   test_stat   p_value    q_value      significant
XLOC_000812   XLOC_000812   CR44143   chr2L:11721285-11722098   WT         mut        OK       0.721719   0         #NAME?              nan         5.00E-05   0.00148681   yes
XLOC_000933   XLOC_000933   CG31813   chr2L:13736139-13736792   WT         mut        OK       1.83633    0         #NAME?              nan         5.00E-05   0.00148681   yes
XLOC_001662   XLOC_001662   CG43165   chr2L:3694462-3694938     WT         mut        OK       1.07611    0         #NAME?              nan         5.00E-05   0.00148681   yes
XLOC_010147   XLOC_010147   CG14244   chr3R:22551862-22552246   WT         mut        OK       0.843113   0         #NAME?              nan         0.0055     0.0576938    no
XLOC_007888   XLOC_007888   CG43391   chr3L:11573928-11574119   WT         mut        OK       2.8848     0         #NAME?              nan         0.01215    0.101753     no
XLOC_001158   XLOC_001158   Arr1      chr2L:18078268-18081187   WT         mut        HIDATA   0          7600.17   0                   0           1          1            no
XLOC_011439   XLOC_011439   Rh3       chr3R:15906781-15908172   WT         mut        HIDATA   2034.74    0         0                   0           1          1            no

Why am I seeing this discrepancy in the differential gene expression between ON and OFF genes. Especially for the genes that get a status called HIDATA.

How can I resolve this discrepancy and improve the calculating of differential gene expression for these ON and OFF genes.

Thank you

Rodrigo

Cufflinks RNA-Seq Cuffdiff • 2.7k views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 8.8 years ago by rodrigo.nicodemos82 ▴ 10

0

Entering edit mode

Not really sure what you mean by ON/OFF genes, or, in your example, which gene(s) are "ON" and which gene(s) are "OFF"? Do you have p-values for the fold changes you found?

ADD REPLY • link 8.8 years ago by pld 5.1k

0

Entering edit mode

Hi Joe, If you scroll to the right in the table you will see that the p_value and q_value for the listed example genes.

ADD REPLY • link 8.8 years ago by rodrigo.nicodemos82 ▴ 10

0

Entering edit mode

Sorry, didn't see the scroll bar at first.

What does the raw file say? #NAME? is an excel problem. I'm assuming you have NaN values there, when there's no expression of a gene in one condition, the fold change is +/-infinity.

A few other people seem to have had this problem:

You could play around with the settings a bit more or check your reads to make sure you're not underestimating even a small amount of expression in the "OFF" cases. You could also just add a small psuedocount to all of the FPKM values. X/0 = infinity, so X/0.000...1 -> infinity, you'll just have massive fold changes instead of infinity.

A non-bioinformatics answer would be to double check these genes with qPCR.

ADD REPLY • link updated 16 months ago by Ram 43k • written 8.8 years ago by pld 5.1k