Question

log2Fold Change as input for k-means analysis

0

Entering edit mode

8 months ago

concetta ▴ 10

Hi,

I am analyzing a RNA-seq dataset that include different treatments and time points. I would like to perform a k-means analysis to cluster my genes. I have already performed a differential expression analysis with DESeq2 and I selected the significantly up-regulated genes based on the log2 fold change and adjusted p-value.

Now, I would like to perform a k-means analysis to cluster my up-regulated genes to identify some groups of genes that are upregulated in a specific treatment and/or time point.

I was wondering if it is correct to use the log2 fold change as input values to perform the k-means analysis instead of the normalized expression values, such as TPM or logarithmic transformed raw counts by DESeq function.

Thanks for the help.

Best,

Concetta

gene-analysis RNA-seq k-means • 460 views

ADD COMMENT • link updated 8 months ago by rfran010 ▴ 900 • written 8 months ago by concetta ▴ 10

1

Entering edit mode

I'm not great with the math but, I believe, I have a working understanding and I'll try to be helpful with that.

I think a Z-score scaling between treatments/time-points may better serve your goal of clustering genes that behave similarly.

The problem with log2FC is that it is biased by gene length/expression. Meaning, long genes or genes with high counts are likely to have lower fold-changes. So, your clusters may be biased by the baseline expression or nature of the gene. So, for example, a low-expressed gene that goes up 10-fold in two timepoints and a high-expressed gene that goes up 2-fold in the same two timepoints have the same behavior, but may be clustered separately.

I think TPM can reduce the bias of gene length, but then clusters may depend on low vs high expression, so you may get clusters biased by high/low expression instead of changes between conditions/timepoints.

Normalized counts would be the worst here, since they suffer the most from expression/gene length bias within each treatment, so clusters will not be meaningful. This is usually very apparent once clusters are made.

With z-score, you will normalize the mean and variance, so that highly expressed genes will be equalized with lowly expressed genes within conditions, and then gene changes between conditions will be normalized so that they are not dominated by high-variance genes. So, if we take the example before of the 10-fold and 2-fold gene changes, both of these could possibly be 1 standard deviation above the mean (zscore =1), and would be expected to cluster together.

However, if log2FC values are really what matter to you, then I suppose it could make sense to use those.

ADD REPLY • link 8 months ago by rfran010 ▴ 900