Rna-Seq Normalization
2
3
Entering edit mode
12.8 years ago
Pasta ★ 1.3k

Hi,

I have sequenced bacterial mRNA samples from 2 different conditions using Illumina. The number of reads and quality is equivalent between the 2 samples but I have no replicates.

I have already normalized the 2 samples on ORF length and total number of reads. Is it good enough or should I add a normalization step ?

Thanks ++

rna next-gen sequencing data • 6.9k views
ADD COMMENT
2
Entering edit mode
12.8 years ago
Philippe ★ 1.9k

Hi,

the normalization of some RNA-Seq (and more globally gene expression) data has to be chosen regarding the kind of analysis you want to perform.

If you have normalized for the read length, number of reads,... (I guess you used a RPKM measure) this is already a good first step.

If you want to do some non-parametric statistical analyses (meaning based on ranks of genes, not their absolute expression values) then you don't need any extra normalization step since the rank won't be affected.

If you want to use the absolute levels you still can do without normalization in some cases but, since most normalization procedures won't affect the distribution within your samples (it will only make the distribution of the two samples look more alike), it might be worth applying one.

There are many methods described and/or used to do some cross-samples normalization. A basic and "neutral" one is to scale the expression values of your two samples so that they can have a similar distribution (for example, a same median value). Some possibilities are to adjust the medians of the two distributions or scale the level of expression of a set of highly expressed housekeeping genes (that you can define based on your data or on previously described set of genes in the literature).

I hope this has been helpful.

ADD COMMENT
1
Entering edit mode

Median scaling is a very (the most?) basic normalization method you can use and it might indeed not be relevant if your distribution is not normal. Scaling based on housekeeping genes might be a good alternative. Nonetheless, with only two samples, it might be more accurate to use a set of previously described housekeeping genes since your statistical power will be very poor to detect genes displaying little variation between samples.

ADD REPLY
0
Entering edit mode

Thanks Philippe. I thought about the median normalization. But is it statistically relevant to use this approach knowing that we do not obtain a normal distribution. Anyway, normalizing on housekeeping genes is a good idea, I will investigate

ADD REPLY
0
Entering edit mode

Thanks Philippe. I thought about the median normalization. But is it statistically relevant to use this approach knowing that we do not obtain a normal distribution ? Anyway, normalizing on housekeeping genes is a good idea, I will investigate

ADD REPLY
0
Entering edit mode
12.8 years ago

I like the normalization as done in DESeq (http://www-huber.embl.de/users/anders/DESeq/) e.g. translated into python:

import numpy as np
def size_factors(counts):
    counts = counts[np.alltrue(counts, axis=1)]
    logcounts = np.log(counts)
    loggeommeans = np.mean(logcounts, axis=1).reshape(len(logcounts), 1)
    sf = np.exp(np.median(logcounts - loggeommeans, axis=0)) 
    return sf
ADD COMMENT
0
Entering edit mode

I am not sure that DESeq works well when there are no replicates

ADD REPLY

Login before adding your answer.

Traffic: 2138 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6