Question: What is the point of standardizing each gene in gene expression data?
1
gravatar for ebrudermanver
12 months ago by
ebrudermanver40 wrote:

I was advised to standardize each gene (so that each gene has a zero mean and unit variance across all samples) when clustering the genes in a gene expression dataset, but I don't understand why. Let's say we have a gene whose expression is around 4000 for all cancer patients, and another gene whose expression is around 4 for all cancer patients, and say the expression levels are highly correlated. Then by standardizing each of those 2 genes, the distance between them (which originally was quite high due to the different scales) will be very low, and those genes will very likely end up in the same gene cluster. However, those genes look quite different to me (totally intuitively), and I don't understand why they should end up in the same cluster.

An example I read about the need for standardizing is that when a variable is in kilograms and another variable is grams, for example, these variables are not directly comparable, so standardization is needed, which makes a lot of sense. But in a gene expression dataset, all variables are in the same unit (say intensity in a microarray dataset), so I cannot relate gene expression data to that example. Can somebody explain why we need standardization for gene expression data generally, as well as specifically for clustering the genes? Thanks.

ADD COMMENTlink modified 12 months ago by theobroma221.1k • written 12 months ago by ebrudermanver40
1
gravatar for Sirus
12 months ago by
Sirus770
Boston/USA
Sirus770 wrote:

I think it is just for visualization purposes.

Suppose that you use a certain threshold to select a group of genes to display (expl: RPM>1), the highly expressed genes will skew the scale in your heatmap, which will make some genes (that are also highly expressed in the same group of samples) seem like as not expressed (they will have the same color as the lowly expressed ones) Some people use log2(RPM+1) to minimize this effect. But, in this case a 2 fold-change will seem as just 1 step change

A z-score, will show the relative change. Which will make the heatmap (clusters) clearer.

ADD COMMENTlink modified 12 months ago • written 12 months ago by Sirus770
0
gravatar for theobroma22
12 months ago by
theobroma221.1k
theobroma221.1k wrote:

The standardization helps to cluster the data. Say two clusters are similar before standardization, then dissimilar after standardization. This means standardizing the data made it capable to distinguish these two clusters rather than erroneously forcing them into one cluster.

ADD COMMENTlink written 12 months ago by theobroma221.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 755 users visited in the last hour