Question

Rna-Seq Normalization

3

Entering edit mode

12.8 years ago

Pasta ★ 1.3k

Hi,

I have sequenced bacterial mRNA samples from 2 different conditions using Illumina. The number of reads and quality is equivalent between the 2 samples but I have no replicates.

I have already normalized the 2 samples on ORF length and total number of reads. Is it good enough or should I add a normalization step ?

Thanks ++

rna next-gen sequencing data • 6.9k views

ADD COMMENT • link updated 12.8 years ago by Marcin Cieslik ▴ 520 • written 12.8 years ago by Pasta ★ 1.3k

score 2 · Answer 1 · 2011-06-21

2

Entering edit mode

12.8 years ago

Philippe ★ 1.9k

Hi,

the normalization of some RNA-Seq (and more globally gene expression) data has to be chosen regarding the kind of analysis you want to perform.

If you have normalized for the read length, number of reads,... (I guess you used a RPKM measure) this is already a good first step.

If you want to do some non-parametric statistical analyses (meaning based on ranks of genes, not their absolute expression values) then you don't need any extra normalization step since the rank won't be affected.

If you want to use the absolute levels you still can do without normalization in some cases but, since most normalization procedures won't affect the distribution within your samples (it will only make the distribution of the two samples look more alike), it might be worth applying one.

There are many methods described and/or used to do some cross-samples normalization. A basic and "neutral" one is to scale the expression values of your two samples so that they can have a similar distribution (for example, a same median value). Some possibilities are to adjust the medians of the two distributions or scale the level of expression of a set of highly expressed housekeeping genes (that you can define based on your data or on previously described set of genes in the literature).

I hope this has been helpful.

ADD COMMENT • link 12.8 years ago by Philippe ★ 1.9k

1

Entering edit mode

Median scaling is a very (the most?) basic normalization method you can use and it might indeed not be relevant if your distribution is not normal. Scaling based on housekeeping genes might be a good alternative. Nonetheless, with only two samples, it might be more accurate to use a set of previously described housekeeping genes since your statistical power will be very poor to detect genes displaying little variation between samples.

ADD REPLY • link 12.8 years ago by Philippe ★ 1.9k

0

Entering edit mode

Thanks Philippe. I thought about the median normalization. But is it statistically relevant to use this approach knowing that we do not obtain a normal distribution. Anyway, normalizing on housekeeping genes is a good idea, I will investigate

ADD REPLY • link 12.8 years ago by Pasta ★ 1.3k

0

Entering edit mode

Thanks Philippe. I thought about the median normalization. But is it statistically relevant to use this approach knowing that we do not obtain a normal distribution ? Anyway, normalizing on housekeeping genes is a good idea, I will investigate

ADD REPLY • link 12.8 years ago by Pasta ★ 1.3k

Ram · Answer 2 · 2011-06-23

0

Entering edit mode

12.8 years ago

Marcin Cieslik ▴ 520

I like the normalization as done in DESeq (http://www-huber.embl.de/users/anders/DESeq/) e.g. translated into python:

import numpy as np
def size_factors(counts):
    counts = counts[np.alltrue(counts, axis=1)]
    logcounts = np.log(counts)
    loggeommeans = np.mean(logcounts, axis=1).reshape(len(logcounts), 1)
    sf = np.exp(np.median(logcounts - loggeommeans, axis=0)) 
    return sf

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 12.8 years ago by Marcin Cieslik ▴ 520

0

Entering edit mode

I am not sure that DESeq works well when there are no replicates

ADD REPLY • link 12.8 years ago by Pasta ★ 1.3k