Question: Comparing Rpkms For One Test Sample Vs Multiple Controls
0
gravatar for Travis
5.9 years ago by
Travis2.8k
USA
Travis2.8k wrote:

Hi all,

I have RPKM values for a single sample (lung adenocarcinoma) and wish to compare it to RPKM values for a group of controls (50 TCGA normal lung samples).

Bearing in mind the one to many nature of the analysis, and RPKMs being the starting point, can someone recommend the best method/software for calculating differential expression with some appropriate measures of significance? At its most basic I have calculated fold changes and Z-scores (mean and median based) but I am guessing this is overly simplistic.

All help appreciated.

ADD COMMENTlink modified 5.9 years ago by Hayssam270 • written 5.9 years ago by Travis2.8k
0
gravatar for Hayssam
5.9 years ago by
Hayssam270
France
Hayssam270 wrote:

Hi, I don't think there's any reason not to start by using one of the available differential expression test in R. I'd recommend edgeR or DESeq. Both have nice tutorials to get you started and both should handle the class imbalance adequately. However these two methods expect raw reads counts, not RPKM. For the TCGA samples, raw counts are available but you have to take level 2 I think. Is there any reasons for you to stick with RPKMs? If yes, be aware that you risk of loosing statistical power by using them.

ADD COMMENTlink written 5.9 years ago by Hayssam270

I had assumed it would not be safe to take raw counts from different sources/centers and attempt differential expression analysis. Do both DESeq and edgeR attempt to correct for issues like differences in sequencing depth?

ADD REPLYlink written 5.9 years ago by Travis2.8k

Different library sizes (due to both different sequencing depth and different ratio of mappable reads) are exactly the raison d'être for these approaches. There's several papers explaining why RPKM is not appropriately dealing with that. See e.g. Differential Gene Expression Analysis - Rpkm Vs Readcount and Rnaseq Differential Expression. About RPKM inconsistencies, you can have a starting look with this blog post.

Furthermore, if you suspect there's some batch effects (e.g. a lab effect for samples coming from different centers), linear modeling in edgeR can help you to correct/account for this. There's a large scale RNA-sequencing effort that got a study published recently and that adequately dealt with batch effects. If that's interesting for you, you could start browsing from the GEUVADIS RNA-Seq website.

ADD REPLYlink modified 5.9 years ago • written 5.9 years ago by Hayssam270
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1107 users visited in the last hour