Question

Normalize gene expression data between 0 and 1

1

Entering edit mode

3.9 years ago

m.taheri ▴ 50

I have a set of RNA-seq gene expression data (normalized by RPM method) from different samples. I need to normalize gene expression values between 0 and 1.

Let:

MIN is minimum gene expression across all genes and all samples

MAX is maximum gene expression across all genes and all samples

MIN_X is minimum expression of gene X across all samples

MAX_X is maximum expression of gene X across all samples

Gene expression values for gene X can be calculated by one of following ways:

1: This results in very small gene expression values:

normalized gene expression X = (gene expression X - MIN ) / (MAX - MIN)

2: This results in loss of information about ratio of gene expression values:

normalized gene expression X = (gene expression X - MIN_X) ./ (MAX_X - MIN_X)

Is there a way to normalize gene expression values in the range [0,1] by avoiding the above issues?

gene-expression • 2.6k views

ADD COMMENT • link 3.9 years ago by m.taheri ▴ 50

1

Entering edit mode

Why don't you normalize with Z-score. That will indicate for each gene how much each sample deviates from the mean off all samples (per gene)?

Transform your counts to log2 and then do t(scale(t(log2matrix))).

ADD REPLY • link 3.9 years ago by ATpoint 81k

0

Entering edit mode

I think by Z-score normalization the expression ratio between genes will be lost. Won't it?

ADD REPLY • link 3.9 years ago by m.taheri ▴ 50

1

Entering edit mode

Yes, that is exactly the point. Each gene can be visualized on the same scale. If you normalize between zero and one then this will be dominated by highly-expressed genes so it is difficult to see differences between genes that are not highly-expressed.

ADD REPLY • link 3.9 years ago by ATpoint 81k

0

Entering edit mode

What if I do Z-score normalization by mean and standard deviation of expression of all genes in all samples? Does it have any benefit over using gene-specific mean and standard deviation values? May you please explain about t and scale functions in t(scale(t(log2matrix))) ?

ADD REPLY • link 3.9 years ago by m.taheri ▴ 50

0

Entering edit mode

You want to divide by gene-specific SD since you have thousands of genes so the mean is not representative for every single gene. The t stands for transpose, and since scale will by default operate column-wise we have to transpose the matrix first to have it row-wise, and then after the job is done transpose it back so genes are again rows and samples are again columns.

ADD REPLY • link 3.9 years ago by ATpoint 81k