Question

How to determine if any samples are outliers in PCA? What is the criteria?

1

Entering edit mode

11 months ago

Yijing ▴ 10

Hello everyone,

I use the following code to plot PCA:

tpm <- (assays(se)$abundance[apply(assays(se)$abundance, MARGIN = 1, FUN = function(x) sd(x) != 0),])   
logtpm <- log2(tpm + 1)
tpm_centered <- t(logtpm-rowMeans(logtpm)) 
pca <- prcomp(tpm_centered, scale=TRUE, center=TRUE) 
pca_df <- data.frame(pca$x, colData(se))
ggplot(pca_df, aes(x = PC1, y = PC2, color = TissueArea)) +  
    geom_point() +   labs(x = "PC1", y = "PC2", color = "TissueArea")

But I do not know how to label the outliers and also I do not know what is the criterial for outliers. Do you have any experience or tutorial to share? Many thanks!

enter image description here

PCA • 657 views

ADD COMMENT • link updated 11 months ago by Jean-Karim Heriche 27k • written 11 months ago by Yijing ▴ 10

score 5 · Answer 1 · 2023-05-15

A classical multivariate outlier detection method makes use of the Mahalanobis distance such that points with a high Mahalanobis distance to the sample mean are considered outliers. In this spirit, you can classify as outlier any point that is over a given threshold distance t from the centre of its class. The threshold t is chosen such that p% of the points are rejected. The Mahalanobis distance has approximately a Chi-squared distribution so t can then be chosen from a Chi-squared table where the number of degrees of freedom is the number of variables and alpha=p/100 represents the expected fraction of outliers. Now to apply this to your data, notice that the Euclidean distance in PCA space is equivalent to the Mahalanobis distance in the original space.