Question: Statistical tests for RPKM comparison
gravatar for rpjl1230
5.1 years ago by
United Kingdom
rpjl123020 wrote:

Hi, I was wondering if anyone can provide some suggestion as to what statistical tests I can use for differential gene expression with RPKM values.

I am stuck with RPKM as DESeq or edgeR is not an option for me unfortunately. I have generated some RNA-Seq data from a dozen tissue samples collected from patients with 2 disease phenotypes (n=6 for each group) and I would like to compare my result to the published RNA-Seq data using in vitro systems. Unfortunately the authors of the published paper only provided a table on RPKM values (and singleton as well for each of the tested conditions!). They also did not deposit any of the fastq files into any database so I cannot even re-do the alignment. 

In this scenario where I have no raw counts on the published data and uneven group size, is there anyway that I can still reliably compare my data to the existing ones and do differential expression analysis? Any suggestion will be very helpful! 

Many thanks!

ADD COMMENTlink modified 5.1 years ago by John12k • written 5.1 years ago by rpjl123020

RPKMs are terrible for statistics. Would it be possible to just analyse your samples and compare the resulting fold-changes/DE genes to those from the published study? That might yield nice results.

ADD REPLYlink written 5.1 years ago by Devon Ryan95k

I thought about doing FC as well but have another dilemma. What would you use as a cut-off? Presumably I will have to use some arbitrary cut-offs, e.g. if >2 log2FC then highly abundant. Is there any way to make it more objective? 

Also, to do FC, will you:

1) Just a standard FC calculation of my data relative to the published data? Or

2) Calculate FC from the median RPKM for each sample and then compare the perturbation of my patient data to the in vitro data? I thought about this because of the patient samples have pretty high intragroup variance (some samples have much higher/lower median RPKM than others within the same group) and I am not sure how to "normalize" properly?

Thanks again!

ADD REPLYlink modified 5.1 years ago • written 5.1 years ago by rpjl123020

I am facing a similar situation with data on GEO only being present in log2 FPKM. After you converted to TPM, how did you end up testing for differential expression?


ADD REPLYlink written 4.2 years ago by acorella30
gravatar for John
5.1 years ago by
John12k wrote:

This is a very good article about the pros and cons of doing stats on FPKM, and how to calculate TPM, which is by all means a better stat when you dont have much else to work with.
There are plenty of others, but that link also links to all the good ones too :)

But to answer the higher-level question of not "how can you" but "why should you" do differential analysis, consider the following sources of noise:
- your technical "hands" verses theirs (experimental noise)
- in vivo VS in vitro (biological noise - this will most likely be huge)
- differences in sequencing technology (paired vs single)
- differences in data quality and total data (read depth, better mapping, etc)
- all the statistical noise inherent in going in to an analysis 'blind' without looking at a specific region/difference to begin with.

We've all got to prioritise time, and I think, personally if it were me, I would either do a very rudimentary analysis to get a ballpark figure for overall correlation, or, not even bothering to download the external data in the first place, and focus my efforts on doing the very best/cleanest analysis with, what sounds like (n=6) some pretty good data by RNA-Seq standards :)

ADD COMMENTlink written 5.1 years ago by John12k

Thanks John! I have already compared my two in vivo phenotypes and the results are indeed quite nice, but we wanted to compare to in vitro conditions to see what stress factors might be associated to each of the two in vivo conditions, hence the published data come handy. 


You are absolutely correct on the differences between our technology, data quality and all the statistical noise - it's almost like comparing apple to pear, but getting some ballpark figure will be nice enough for us.

One question regarding TPM. I have been doing some readings (including the Wagner paper) but I don't quite understand what Z (mean read length) stands for? Is it referring to the average bp of the reads from the sequencer? Or is it referring to the average bp mapped to each gene in a given sample? Will greatly appreciate it if someone can clarify for me. Thanks again!



ADD REPLYlink written 5.1 years ago by rpjl123020
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1984 users visited in the last hour