How to determine if any samples are outliers in PCA? What is the criteria?
1
1
Entering edit mode
11 months ago
Yijing ▴ 10

Hello everyone,

I use the following code to plot PCA:

tpm <- (assays(se)$abundance[apply(assays(se)$abundance, MARGIN = 1, FUN = function(x) sd(x) != 0),])   
logtpm <- log2(tpm + 1)
tpm_centered <- t(logtpm-rowMeans(logtpm)) 
pca <- prcomp(tpm_centered, scale=TRUE, center=TRUE) 
pca_df <- data.frame(pca$x, colData(se))
ggplot(pca_df, aes(x = PC1, y = PC2, color = TissueArea)) +  
    geom_point() +   labs(x = "PC1", y = "PC2", color = "TissueArea")

But I do not know how to label the outliers and also I do not know what is the criterial for outliers. Do you have any experience or tutorial to share? Many thanks!

enter image description here

PCA • 657 views
ADD COMMENT
5
Entering edit mode
11 months ago

A classical multivariate outlier detection method makes use of the Mahalanobis distance such that points with a high Mahalanobis distance to the sample mean are considered outliers. In this spirit, you can classify as outlier any point that is over a given threshold distance t from the centre of its class. The threshold t is chosen such that p% of the points are rejected. The Mahalanobis distance has approximately a Chi-squared distribution so t can then be chosen from a Chi-squared table where the number of degrees of freedom is the number of variables and alpha=p/100 represents the expected fraction of outliers. Now to apply this to your data, notice that the Euclidean distance in PCA space is equivalent to the Mahalanobis distance in the original space.

ADD COMMENT

Login before adding your answer.

Traffic: 3157 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6