Question

Large Variation In Rpkm Values Of Different Biological Replicates From Cufflinks

1

Entering edit mode

10.2 years ago

hoangtv ▴ 30

Hi all, I want to rank gene expression based on the transcript abundance to see the relative expressions of different genes in my RNA-seq sample. I ran Cufflinks for 3 biological replicates for that condition to get the FPKM values. However, the RPKM values are very different within the 3 biological replicates, which I am not sure why and how to deal with this. In Cufflinks manual, It said that Cufflinks can only use classic-fpkm normalization on one library at a time. Should I use Cuffdiff2 to normalize FPKMs and fragment counts across all libraries (This policy is identical to the one used in DESeq) and take the average of these normalized RPKMs?. Or do you think it is ok to get the FPKMs from biological replicates produced by Cufflinks and feed them into DESeq to normalize them?

Please share your opinions. Regards Thanh

cufflinks deseq replicates rpkm • 6.2k views

ADD COMMENT • link updated 4.7 years ago by Biostar 20 • written 10.2 years ago by hoangtv ▴ 30

0

Entering edit mode

Just out of curiosity, have you compared your replicates at the read level, thereby making sure that your biological replicates do in fact show very similar tag density profile? You could compare the aligned bam files.

ADD REPLY • link 10.2 years ago by vj ▴ 520

score 1 · Answer 1 · 2014-02-11

1

Entering edit mode

10.2 years ago

Charles Warden 8.2k

I use gene-level abundance (from genes.fpkm_tracking) for differential expression. At least in my experience, every time that I saw a potentially interesting different expression trends for different transcripts of the same gene, the alignment didn't really seem to support the predicted differences (I think the issue may be differences in low coverage, but I think uneven coverage is also a confounding issue). I have found differential splicing events (predicted from MATS, for example) to be decent, but that is not really related to gene expression differences / RPKM measurements.

In other words, do you get more consistent results from biological replicates from genes.fpkm_tracking rather than isoforms.fpkm_tracking (or gene_exp.diff versus isoform_exp.diff, from cuffdiff)? Of course, there may just be true biological variability, but this is one potential technical issue that I am aware of.

FYI, you need to provide DESeq with raw counts, not FPKM values.

Although most of my experience has been with cuffdiff (not cuffdiff2), I'm pretty sure the actual results will not be the same for cuffdiff (or cuffdiff2) and DESeq (or DESeq2). In fact, the cuffdiff2 paper compares the differences in results for cuffdiff2, DESeq, and edgeR:

http://www.nature.com/nbt/journal/v31/n1/full/nbt.2450.html

My expectation is that DESeq will probably give better results than cuffdiff, but I think it will be useful to compare your own results. Once you have a table of raw counts, DESeq will run much more quickly than cuffdiff. So,your own personal benchmark shouldn't take that long, and I think it can help you be a lot more confident in your results.

ADD COMMENT • link 10.2 years ago by Charles Warden 8.2k

0

Entering edit mode

Thanks cwarden45 very much for your info, For my data , I want to look at 2 things: transcript abundance and differential gene expression. Actually, I did compared Cuffdiff2, Deseq, CLCbio and EdgeR with qPCR validation for a handful of genes. I found that Cufdiff2 and DESeq results correlate best with qPCR results as compared with other tools. However, the total number of differential expressed genes found by DESeq and Cuffdiff2 are different. Because DESeq results correlate with qPCR better than Cufdiff2, so we decide to choose DESeq for differential gene expression. But to look at transcript abundance, I may need to get some kind of normalized RPKMs. Since we are going to publish our analysis, I am not sure I can use DESeq for differential expression and Cuffdiff2 to get normalized RKPM for transcript abundance. I am not sure whether the reviewers will be convinced with this approach. I am aslo thinking about converted normalized read counts produced by DESeq to RPKM, but not sure if anyone did this and whether this is a good way to go.

ADD REPLY • link 10.2 years ago by hoangtv ▴ 30

0

Entering edit mode

I would typically consider differential gene expression and differential transcript expression to be the same sort of thing, but working with different sets of annotations and methodologies.

Are you not trying to identify gene isoforms that vary between conditions? If not, can you help clarify what you want to do with the transcript abundance results?

As long as you are providing counts, it doesn't really matter to DESeq how the genomic regions to bin those counts were defined.

Also, I would typically recommend DESeq and limma among the popular tools for RNA-Seq differential expression, but I also developed the sRAP package (if that helps):

http://www.bioconductor.org/packages/release/bioc/html/sRAP.html

I also have these benchmarks (which include sRAP):

http://cdwscience.blogspot.com/2013/11/rna-seq-differential-expression.html

http://bioinfo.aizeonpublishers.net/content/2013/6/285-292.html

However, I am not sure how the transcript abundance is being used (if not for differential expression), so I'm not sure if this actually helps solve your problem.

ADD REPLY • link 10.2 years ago by Charles Warden 8.2k

0

Entering edit mode

Hi cwardin45, Thanks for your reply and sharing your tool. I mean that I only want to look at the gene expression level and differential gene expression between 2 conditions , not isoform expression level and isoform differential expression. Also, in the same condition, I want to to see the relative gene expression level between different genes, in some kinds of normalized RKPM. Since DESeq can provide normalized counts, which is not normalized to gene length, I am looking for some other tools to get genes RKPMs. I am thinking to use easyRNA-seq to get RPKMs value from normalized read counts produced by DESeq, but I am not confident this is a good way to go with.

ADD REPLY • link 10.2 years ago by hoangtv ▴ 30

0

Entering edit mode

For gene-level abundance, I think cufflinks should be OK. Transcript / isoform level abundance may vary more, but it sounds like this isn't essential for your purposes.

You can see Figure 5 in the paper (linked in the last response) shows that cufflinks and an RSEM-like algorithm in Partek show very consistent results at the gene level (and OK consistency at the transcript level). You may not care as much about the x-axis, but it represents RPKM rounding cutoffs (since the goal was to see how much gene lists varied with different rounding cutoffs).

If you don't like the cuffdiff results (and you don't like using DESeq because you are looking at the variance-alone per group based upon RPKM value (?)), I was recommending using limma or sRAP to analyze the RPKM values from cufflinks. This should be OK. This is basically what I do, and the benchmarks in that paper were used to justify my strategy.

Also, if you were really worried about it, you could use a gene / transcript reference (rather than a chromosome-based reference), align reads with Bowtie or BWA, and calculate RPKM values on your own using samtools idxstats. However, I think this is probably overkill - I really think gene-level cufflinks RPKM value should be OK.

ADD REPLY • link 10.2 years ago by Charles Warden 8.2k

0

Entering edit mode

Actually, I wouldn't recommend using idxstats to calculate RPKM values. I was curious to see how these results would look, so I compared them to eXpress (using the same transcript-based references) as well as cuffdiff and Partek (which uses and RSEM-like method for mRNA quantification):

http://cdwscience.blogspot.com/2014/02/mrna-quantification-via-express.html

Not sure if this was really a problem for you. However, I wanted to give a heads-up that eXpress would probably be a better option than my idxstats suggestion.

ADD REPLY • link 10.2 years ago by Charles Warden 8.2k