Question

Best Way To Normalize Rna-Seq Data For Expression Profiling

6

Entering edit mode

10.4 years ago

ajc8 ▴ 120

Hello,

I have transcript counts from RNA-seq data. There are three samples, and they are biological replicates of a cell line. My goal is to provide a ranked list of expressed genes, with some sort of expression quantifier for each gene/transcript.

I am wondering the best way to normalize the data - just calculate RPKM values and then remove outliers per transcript? Or should I perform some sort of upper quantile normalization? If so, what is the best way to do this?

Thank you for your help!

rna-seq normalization • 39k views

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 10.4 years ago by ajc8 ▴ 120

2

Entering edit mode

I am probably stating the obvious, but RNA-seq is not a measure of absolute expression. It is closer to absolute expression than microarrays (probably), but comparisons between genes (ranking by expression) should be taken with multiple, large grains of salt.

ADD REPLY • link 10.4 years ago by Sean Davis 26k

1

Entering edit mode

RPKM between replicates should be fine, but if you want to optimize check: Optimal Scaling of Digital Transcriptomes

ADD REPLY • link 10.4 years ago by JC 13k

Ram · Answer 1 · 2013-12-11

18

Entering edit mode

10.4 years ago

rtliu ★ 2.2k

Please check out Professor Lior Pachter keynote in Genome Informatics 2013 meeting at CSHL:

Quoted from slide page 45

The problem with FPKM:

Although abundances in FPKM are proportional to the relative abundances, the proportionality constant is experiment specific.

Li and Dewey go back to the basics in the RSEM paper (BMC Bioinformatics, 2011).
Instead of RPKM/FPKM, why not use a universal proportionality constant?
Instead of *, they propose TPM

Please use TPM in your papers!

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 10.4 years ago by rtliu ★ 2.2k

1

Entering edit mode

Lior Pacther's slides and presentation are helpful. Here is another another helpful video from one of his students, with even more explicit discussion about FKPM and TPM:

ADD REPLY • link 8.2 years ago by Kamil ★ 2.3k

0

Entering edit mode

Although I theoretically agree with this, I think assigning the reads in the right transcripts is not trivial (whereas gene-level quantification are very consistent among different tools).

As read length increases and coverage becomes more even, I might change my mind, but I think RPKM is currently the most practical strategy and tools using RPKM measurements (such as limma) are known to provide accurate results. For example, you can check out these differential expression comparisons:

http://bib.oxfordjournals.org/content/early/2013/12/02/bib.bbt086.long

"In general, limma performed well under many circumstances in the present comparisons, being also computationally fastest to run."

http://genomebiology.com/2013/14/9/R95

"We find significant differences among the methods, but note that array-based methods adapted to RNA-seq data perform comparably to methods designed for RNA-seq."

http://cdwscience.blogspot.com/2013/11/rna-seq-differential-expression.html

*Recommend Partek or DESeq over edgeR or cuffdiff

ADD REPLY • link 10.4 years ago by Charles Warden 8.2k

score 2 · Answer 2 · 2015-06-16

2

Entering edit mode

8.9 years ago

Amitm ★ 2.2k

Sorry to bump an old thread but I have been struggling with non-replicate RNA-seq data for long. So, in relation to the previous post, I don't think conversion to Z-scores is a valid idea.

Z-score assumes an underlying Normal distribution which is not at all the case with RNA-seq data.

ADD COMMENT • link 8.9 years ago by Amitm ★ 2.2k

score 0 · Answer 3 · 2013-12-10

0

Entering edit mode

10.4 years ago

Charles Warden 8.2k

RPKM should be fine.

Sometimes I've seen samples with different median RPKM values, but trying to median center (or quantile normalize) the data ended up making the resulting gene lists worse (in regards to functional enrichment and positive controls).

ADD COMMENT • link 10.4 years ago by Charles Warden 8.2k

score 0 · Answer 4 · 2013-12-11

0

Entering edit mode

10.4 years ago

vj ▴ 520

If your goal is to just rank the genes based on their expression, then you could try converting them to z scores. You can use the "scale" funtion in R to do it.

ADD COMMENT • link 10.4 years ago by vj ▴ 520

0

Entering edit mode

I'm not a statistician, but I remember that z scores assume normal distribution of data. Right? I think RNA-seq doesn't follow normal distribution and hence you will have a significant error in your estimation.

ADD REPLY • link 7.9 years ago by microbe77 ▴ 30