Normalize gene expression data between 0 and 1
0
1
Entering edit mode
3.9 years ago
m.taheri ▴ 50

I have a set of RNA-seq gene expression data (normalized by RPM method) from different samples. I need to normalize gene expression values between 0 and 1.

Let:

MIN is minimum gene expression across all genes and all samples

MAX is maximum gene expression across all genes and all samples

MIN_X is minimum expression of gene X across all samples

MAX_X is maximum expression of gene X across all samples

Gene expression values for gene X can be calculated by one of following ways:

1: This results in very small gene expression values:

normalized gene expression X = (gene expression X - MIN ) / (MAX - MIN)

2: This results in loss of information about ratio of gene expression values:

normalized gene expression X = (gene expression X - MIN_X) ./ (MAX_X - MIN_X)

Is there a way to normalize gene expression values in the range [0,1] by avoiding the above issues?

gene-expression • 2.6k views
ADD COMMENT
1
Entering edit mode

Why don't you normalize with Z-score. That will indicate for each gene how much each sample deviates from the mean off all samples (per gene)?

Transform your counts to log2 and then do t(scale(t(log2matrix))).

ADD REPLY
0
Entering edit mode

I think by Z-score normalization the expression ratio between genes will be lost. Won't it?

ADD REPLY
1
Entering edit mode

Yes, that is exactly the point. Each gene can be visualized on the same scale. If you normalize between zero and one then this will be dominated by highly-expressed genes so it is difficult to see differences between genes that are not highly-expressed.

ADD REPLY
0
Entering edit mode

What if I do Z-score normalization by mean and standard deviation of expression of all genes in all samples? Does it have any benefit over using gene-specific mean and standard deviation values? May you please explain about t and scale functions in t(scale(t(log2matrix))) ?

ADD REPLY
0
Entering edit mode

You want to divide by gene-specific SD since you have thousands of genes so the mean is not representative for every single gene. The t stands for transpose, and since scale will by default operate column-wise we have to transpose the matrix first to have it row-wise, and then after the job is done transpose it back so genes are again rows and samples are again columns.

ADD REPLY

Login before adding your answer.

Traffic: 1610 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6