Question

Pre-processing RNAseq data before PCA or tSNE: Log-transform data?

0

Entering edit mode

3.2 years ago

psm ▴ 130

I have a clinical bulk RNAseq dataset with 3 different conditions/groups. I notice that if I use a standard workflow of scaling/centering the data before dimensionality reduction (PCA and tSNE), I get a messy plot of the patients. But when I log-transform the data first, the groups become distinct and tight.

Is this an artifact of my data? Is it generally better to log transform raw count data prior to scaling for dimensionality data?

RNA-Seq • 2.3k views

ADD COMMENT • link updated 3.2 years ago by ATpoint 82k • written 3.2 years ago by psm ▴ 130

score 2 · Answer 1 · 2021-02-11

2

Entering edit mode

3.2 years ago

ATpoint 82k

The typical process is first to normalize data for sequencing depth and composition, e.g. with edgeR or DESeq2, and then to log-transform data followed by PCA and/or any other reduction technique. Logging raw counts is not meaningful as they still have the sequencing depth bias. You can also use dedicated variance-stabilizing transformations (vst) which do the normalization and log transformation and vst all in one go, e.g. vst or rlog functions from DESeq2.

ADD COMMENT • link 3.2 years ago by ATpoint 82k

0

Entering edit mode

Thank you for those points. Sorry, I should have clarified - I did use DESeq to normalize the data first. But if I understand your correctly, even after normalizing the data, it is pretty standard to log transform data prior to PCA/dimensionality reduction. Appreciate the clarification!

ADD REPLY • link 3.2 years ago by psm ▴ 130

score 1 · Answer 2 · 2021-02-11

I don't know whether this is exactly the case with your data, but it sounds like you have a skewed data distribution which is common for RNAseq data. In such a case a power transformation can bring the data to a normal distribution. See here and here about Box-Cox transformation that can be used for this purpose.

Many people use log-transformation without understanding why it works. It so happens that when Box-Cox factor lambda is zero, the appropriate power transformation is log(X). You can check if that's the case for your data if you have python and scipy installed. Even if calculated lambda for your data is not exactly zero but close enough, log-transformation will work.