Pre-processing RNAseq data before PCA or tSNE: Log-transform data?
2
0
Entering edit mode
3.2 years ago
psm ▴ 130

I have a clinical bulk RNAseq dataset with 3 different conditions/groups. I notice that if I use a standard workflow of scaling/centering the data before dimensionality reduction (PCA and tSNE), I get a messy plot of the patients. But when I log-transform the data first, the groups become distinct and tight.

Is this an artifact of my data? Is it generally better to log transform raw count data prior to scaling for dimensionality data?

RNA-Seq • 2.3k views
ADD COMMENT
2
Entering edit mode
3.2 years ago
ATpoint 82k

The typical process is first to normalize data for sequencing depth and composition, e.g. with edgeR or DESeq2, and then to log-transform data followed by PCA and/or any other reduction technique. Logging raw counts is not meaningful as they still have the sequencing depth bias. You can also use dedicated variance-stabilizing transformations (vst) which do the normalization and log transformation and vst all in one go, e.g. vst or rlog functions from DESeq2.

ADD COMMENT
0
Entering edit mode

Thank you for those points. Sorry, I should have clarified - I did use DESeq to normalize the data first. But if I understand your correctly, even after normalizing the data, it is pretty standard to log transform data prior to PCA/dimensionality reduction. Appreciate the clarification!

ADD REPLY
1
Entering edit mode
3.2 years ago
Mensur Dlakic ★ 27k

I don't know whether this is exactly the case with your data, but it sounds like you have a skewed data distribution which is common for RNAseq data. In such a case a power transformation can bring the data to a normal distribution. See here and here about Box-Cox transformation that can be used for this purpose.

Many people use log-transformation without understanding why it works. It so happens that when Box-Cox factor lambda is zero, the appropriate power transformation is log(X). You can check if that's the case for your data if you have python and scipy installed. Even if calculated lambda for your data is not exactly zero but close enough, log-transformation will work.

ADD COMMENT
0
Entering edit mode

I appreciate the in-depth answer! I will definitely look into those links. Cheers

ADD REPLY

Login before adding your answer.

Traffic: 2521 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6