Hi guys,
Currently I am analysing some gene expression data. The dataset was analyzed in several studies before. I have identify one particular study and they used a standard K-mean clustering to identify different phenotypes.
My main goal is to perform a UMAP clustering on the data to explore other phenotypes. But before that step, they have used a powerTransformation function in the pre-processing step to approximate the data to a normal distribution. Now I have to do the same but struggle in this step.
I have tried running on powerTransform(expression values ~ different clinical variables) and got some results. These clinical variables include numeric and character type data.
Am I doing the right thing here? or if there is any step I'm missing? I read that I need to find out what the Lambda is before everything, but I'm not sure.......would be lovely to hear your thoughts!
Thanks!
If you are referencing another paper, please link it. There are also some missing items which would be helpful to know:
1) What are you trying to cluster? Patients? Tumor RNA-seq? Single cell data?
2) What features are you trying to use to cluster?
3) When you say "UMAP clustering" - what precisely do you mean? UMAP is primarily a data visualization tool to embed data into 2D space, it is not itself a clustering procedure.
I'm trying to cluster those patient's gene expression data. There are several clinical features that they have and I want to include as much as possible.
Come to think of it, do you think I should perform a scree plot and determine which variables and how many of them to include? Perform a PCA?
Gene expression data is typically normalized to CPM or TPM, and then subsequently logged. TMM and other transforms are also applied.
If you are trying to include both gene expression and clinical information in the same clustering, you will certainly find that with O(10,000) gene expression features and O(10) clinical features, the clinical information has very little impact. It would therefore make sense to extract a small number of features (via PCA or other method) from the GEx data, to reduce the impact on clustering.
As I mention below, UMAP is based on a neighbor graph, so it would be possible to scale features relative to one another so that distances are more sensitive to clinical than GEx features; but a first-pass approach might be:
Normalize RNA, log1p transform, PCA
Box-cox or bin clinical variables; normalize
Compute distances on PCA + normalized clinical variables*
UMAP
*The distance computation / neighbor graph may be built in to whatever package you're using for UMAP
LChart thank you for the suggestions!
Do you mind to point me to the right direction on calculating the neighbour graph before performing UMAP?
Also right now, my dataframe have NA and UMAP cannot process that. I did think of replacing it with zero, however a NA is not zero. Logically NA is NA, it's not detected for some reason but it doesn't mean it didn't have any expression.
What do you think I should do?
It's either built-in to the umap function; or it is another function within the same package (such as
knn
orknndescent
or something like that). The representation of the graph will matter for the next step, so mixing and matching libraries may not be a good idea.You will need to impute missing data prior to computing distances; or drop records with NA.