Pre-processing RNAseq data before PCA or tSNE: Log-transform data?
2
0
Entering edit mode
19 months ago
psm ▴ 100

I have a clinical bulk RNAseq dataset with 3 different conditions/groups. I notice that if I use a standard workflow of scaling/centering the data before dimensionality reduction (PCA and tSNE), I get a messy plot of the patients. But when I log-transform the data first, the groups become distinct and tight.

Is this an artifact of my data? Is it generally better to log transform raw count data prior to scaling for dimensionality data?

RNA-Seq • 1.1k views
2
Entering edit mode
19 months ago
ATpoint 65k

The typical process is first to normalize data for sequencing depth and composition, e.g. with edgeR or DESeq2, and then to log-transform data followed by PCA and/or any other reduction technique. Logging raw counts is not meaningful as they still have the sequencing depth bias. You can also use dedicated variance-stabilizing transformations (vst) which do the normalization and log transformation and vst all in one go, e.g. vst or rlog functions from DESeq2.

0
Entering edit mode

Thank you for those points. Sorry, I should have clarified - I did use DESeq to normalize the data first. But if I understand your correctly, even after normalizing the data, it is pretty standard to log transform data prior to PCA/dimensionality reduction. Appreciate the clarification!

1
Entering edit mode
19 months ago
Mensur Dlakic ★ 20k

I don't know whether this is exactly the case with your data, but it sounds like you have a skewed data distribution which is common for RNAseq data. In such a case a power transformation can bring the data to a normal distribution. See here and here about Box-Cox transformation that can be used for this purpose.

Many people use log-transformation without understanding why it works. It so happens that when Box-Cox factor lambda is zero, the appropriate power transformation is log(X). You can check if that's the case for your data if you have python and scipy installed. Even if calculated lambda for your data is not exactly zero but close enough, log-transformation will work.

0
Entering edit mode

I appreciate the in-depth answer! I will definitely look into those links. Cheers