Best Way To Normalize Rna-Seq Data For Expression Profiling
4
6
Entering edit mode
10.4 years ago
ajc8 ▴ 120

Hello,

I have transcript counts from RNA-seq data. There are three samples, and they are biological replicates of a cell line. My goal is to provide a ranked list of expressed genes, with some sort of expression quantifier for each gene/transcript.

I am wondering the best way to normalize the data - just calculate RPKM values and then remove outliers per transcript? Or should I perform some sort of upper quantile normalization? If so, what is the best way to do this?

Thank you for your help!

rna-seq normalization • 39k views
ADD COMMENT
2
Entering edit mode

I am probably stating the obvious, but RNA-seq is not a measure of absolute expression. It is closer to absolute expression than microarrays (probably), but comparisons between genes (ranking by expression) should be taken with multiple, large grains of salt.

ADD REPLY
1
Entering edit mode

RPKM between replicates should be fine, but if you want to optimize check: Optimal Scaling of Digital Transcriptomes

ADD REPLY
18
Entering edit mode
10.4 years ago
rtliu ★ 2.2k

Please check out Professor Lior Pachter keynote in Genome Informatics 2013 meeting at CSHL:

Quoted from slide page 45

The problem with FPKM:

Although abundances in FPKM are proportional to the relative abundances, the proportionality constant is experiment specific.

Li and Dewey go back to the basics in the RSEM paper (BMC Bioinformatics, 2011).
Instead of RPKM/FPKM, why not use a universal proportionality constant?
Instead of *, they propose TPM

Please use TPM in your papers!

ADD COMMENT
1
Entering edit mode

Lior Pacther's slides and presentation are helpful. Here is another another helpful video from one of his students, with even more explicit discussion about FKPM and TPM:

ADD REPLY
0
Entering edit mode

Although I theoretically agree with this, I think assigning the reads in the right transcripts is not trivial (whereas gene-level quantification are very consistent among different tools).

As read length increases and coverage becomes more even, I might change my mind, but I think RPKM is currently the most practical strategy and tools using RPKM measurements (such as limma) are known to provide accurate results. For example, you can check out these differential expression comparisons:

http://bib.oxfordjournals.org/content/early/2013/12/02/bib.bbt086.long

"In general, limma performed well under many circumstances in the present comparisons, being also computationally fastest to run."

http://genomebiology.com/2013/14/9/R95

"We find significant differences among the methods, but note that array-based methods adapted to RNA-seq data perform comparably to methods designed for RNA-seq."

http://cdwscience.blogspot.com/2013/11/rna-seq-differential-expression.html

*Recommend Partek or DESeq over edgeR or cuffdiff

ADD REPLY
2
Entering edit mode
8.9 years ago
Amitm ★ 2.2k

Sorry to bump an old thread but I have been struggling with non-replicate RNA-seq data for long. So, in relation to the previous post, I don't think conversion to Z-scores is a valid idea.

Z-score assumes an underlying Normal distribution which is not at all the case with RNA-seq data.

ADD COMMENT
0
Entering edit mode
10.4 years ago

RPKM should be fine.

Sometimes I've seen samples with different median RPKM values, but trying to median center (or quantile normalize) the data ended up making the resulting gene lists worse (in regards to functional enrichment and positive controls).

ADD COMMENT
0
Entering edit mode
10.4 years ago
vj ▴ 520

If your goal is to just rank the genes based on their expression, then you could try converting them to z scores. You can use the "scale" funtion in R to do it.

ADD COMMENT
0
Entering edit mode

I'm not a statistician, but I remember that z scores assume normal distribution of data. Right? I think RNA-seq doesn't follow normal distribution and hence you will have a significant error in your estimation.

ADD REPLY

Login before adding your answer.

Traffic: 1847 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6