Question

Advices on Box-Cox transformation (powerTransform function) before UMAP clustering process statistics

0

Entering edit mode

20 months ago

jscl1n22 • 0

Hi guys,

Currently I am analysing some gene expression data. The dataset was analyzed in several studies before. I have identify one particular study and they used a standard K-mean clustering to identify different phenotypes.

My main goal is to perform a UMAP clustering on the data to explore other phenotypes. But before that step, they have used a powerTransformation function in the pre-processing step to approximate the data to a normal distribution. Now I have to do the same but struggle in this step.

I have tried running on powerTransform(expression values ~ different clinical variables) and got some results. These clinical variables include numeric and character type data.

Am I doing the right thing here? or if there is any step I'm missing? I read that I need to find out what the Lambda is before everything, but I'm not sure.......would be lovely to hear your thoughts!

Thanks!

Transformation UMAP • 1.5k views

ADD COMMENT • link updated 20 months ago by LChart 4.3k • written 20 months ago by jscl1n22 • 0

1

Entering edit mode

If you are referencing another paper, please link it. There are also some missing items which would be helpful to know:

1) What are you trying to cluster? Patients? Tumor RNA-seq? Single cell data?

2) What features are you trying to use to cluster?

3) When you say "UMAP clustering" - what precisely do you mean? UMAP is primarily a data visualization tool to embed data into 2D space, it is not itself a clustering procedure.

ADD REPLY • link 20 months ago by LChart 4.3k

0

Entering edit mode

I'm trying to cluster those patient's gene expression data. There are several clinical features that they have and I want to include as much as possible.

Come to think of it, do you think I should perform a scree plot and determine which variables and how many of them to include? Perform a PCA?

ADD REPLY • link 20 months ago by jscl1n22 • 0

1

Entering edit mode

Gene expression data is typically normalized to CPM or TPM, and then subsequently logged. TMM and other transforms are also applied.

If you are trying to include both gene expression and clinical information in the same clustering, you will certainly find that with O(10,000) gene expression features and O(10) clinical features, the clinical information has very little impact. It would therefore make sense to extract a small number of features (via PCA or other method) from the GEx data, to reduce the impact on clustering.

As I mention below, UMAP is based on a neighbor graph, so it would be possible to scale features relative to one another so that distances are more sensitive to clinical than GEx features; but a first-pass approach might be:

Normalize RNA, log1p transform, PCA
Box-cox or bin clinical variables; normalize
Compute distances on PCA + normalized clinical variables*
UMAP

*The distance computation / neighbor graph may be built in to whatever package you're using for UMAP

ADD REPLY • link 20 months ago by LChart 4.3k

0

Entering edit mode

LChart thank you for the suggestions!

Do you mind to point me to the right direction on calculating the neighbour graph before performing UMAP?

Also right now, my dataframe have NA and UMAP cannot process that. I did think of replacing it with zero, however a NA is not zero. Logically NA is NA, it's not detected for some reason but it doesn't mean it didn't have any expression.

What do you think I should do?

ADD REPLY • link 20 months ago by jscl1n22 • 0

1

Entering edit mode

Do you mind to point me to the right direction on calculating the neighbour graph before performing UMAP?

It's either built-in to the umap function; or it is another function within the same package (such as knn or knndescent or something like that). The representation of the graph will matter for the next step, so mixing and matching libraries may not be a good idea.

You will need to impute missing data prior to computing distances; or drop records with NA.

ADD REPLY • link 20 months ago by LChart 4.3k

score 1 · Answer 1 · 2023-03-06

1

Entering edit mode

20 months ago

Mensur Dlakic ★ 28k

This is the first time I hear Box-Cox (or any power transformation) being done before UMAP. Since UMAP is a non-linear dimensionality reduction method, it should be able to approximate power transformations without the need to pre-process the data. That said, can't say that that I foresee an obvious error in using Box-Cox before UMAP, although it will probably squeeze the data into a tighter range. If that is desired, it could work.

Box-Cox is a simple numerical transformation that works only with positive values, and has lambda as a transformation factor. So yes, that needs to be calculated before the power transformation.

ADD COMMENT • link 20 months ago by Mensur Dlakic ★ 28k

1

Entering edit mode

UMAP is based on neighbor embeddings; so the neighbor graph needs to be computed prior to its application. Normalizing transformations can be applied to alter pairwise distances; i.e. so that extreme differences in one (long-tailed) dimension do not outweigh equality in other dimensions. In fact for single-cell data, CPMs are almost always log transformed prior to PCA and neighbor calculation. For other kinds of data it may make sense to apply a more general transformation.