Question

varianceStabilizingTransformation() of FPKM data, count data unavailable

0

Entering edit mode

8.2 years ago

KRR • 0

Hi BioStars,

New to the bioinformatics community here and I have a question about how to do differential gene analysis using the vst-limma pipeline for RNA-seq data. Unfortunately I don't have FASTQ of BAM files to generate raw counts, but I do have CuffLink generated FPKM values. Data looks like the following:

gene       refseq           S00022       S00035       S00050       S00213       S00356
A1BG       NM_130786        14.0824      5.46565      3.70024      5.69252      4.90083
A1CF       NM_014576        0.010387     0.005099     0.002786     0.00199      0
A1CF       NM_138932        0.000402     0.000422     0.000231     0.000331     0
A1CF       NM_138933        0            0            0            2.00E-06     0
A1CF       NM_001198818     0            0            0            2.00E-06     0
A1CF       NM_001198820     0            0            0            0            0
A1CF       NM_001198819     0            0            0            0            0
A2LD1      NM_001195087     0.863905     1.15179      1.3101       0.993293     1.37598
A2LD1      NM_033110        0.447098     0.246576     12.4908      0.201043     0.088599
A2M        NM_000014        28.1252      39.6673      45.7157      86.3615      125.923
A2ML1      NM_144670        0            0            0            0            0
A4GALT     NM_017436        6.27533      9.83301      4.0222       5.04065      2.20022

After summing FPKM values across transcripts to obtain gene level counts, then taking a subset of the most variable genes, I want to use the DESeq package and specifically the estimateDispersions() and varianceStabilizingTransformation() functions to generate transformed FPKM values for use in the limma package. The end goal is to perform differential gene expression analysis.

I think my problem is that I can't use these functions directly on my imported data matrix because they require data in the form of the CountDataSet class. At least, I get the following command error:

mat_disp <- estimateDispersions(my_mat_mad,method = "blind",fitType="local")

Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function 'estimateDispersions' for signature '"matrix"'

Any advice about how to go about this analysis? If anyone can comment on the validity of this approach as well, I appreciate any and all insights.

Kind regards,
KR

RNA-Seq R • 2.8k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.2 years ago by KRR • 0

Ram · Accepted Answer · 2016-02-10

4

Entering edit mode

8.2 years ago

ablanchetcohen ★ 1.2k

The approach is not valid. You need the raw counts, period. There are many transformations to the raw data carried out by Cufflinks that will bias any downstream analysis performed with the FPKM values outputted by Cufflinks.

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.2 years ago by ablanchetcohen ★ 1.2k

1

Entering edit mode

I agree totally with ablanchetcohen, get the raw counts if you want to make use of DEseq or limma. Don't waste your time with FPKM data or try to transform it back. It is not possible. Just ask your sequence facility to give the fastq or bam files. If they don't keep them for you find another sequence facility for future experiments.

ADD REPLY • link 8.2 years ago by Benn 8.3k