Question

Which of PCA or Heatmap plots are better to exclude outlier replicates from normalised microarray or similar datasets?

0

Entering edit mode

3.3 years ago

Microuser • 0

Hello,

I have a general question about finding the outliers in microarray data. For my normalised datasets, I have generated the PCA and heatmap plots with samples clustering. My heatmap plot shows the triplicates cluster together. But, looking at PCA plot, on PC1, one replicate might be much further away from the other two replicates, like having two at +60 and the other being at -20 on PC1 vector. On PC1 more than 55% variance is explained (at least) and all the replicates show rather similar position relative to PCA2 on the plot. My question is which of PCA or heatmap plots are more accurate to use for excluding the outliers from the sample and why?

Your opinions are very appreciated. Thank you

microarray outlier pca heatmap • 9.0k views

ADD COMMENT • link updated 3.3 years ago by Mensur Dlakic ★ 27k • written 3.3 years ago by Microuser • 0

1

Entering edit mode

3.3 years ago

Mensur Dlakic ★ 27k

I will preface this by saying that I don't use PCA for the same purpose you do, so my advice may be of limited use to you.

In many machine learning datasets I have handled, the first two PCs are not very reliable in identifying outliers. To digress for a moment, it would be helpful if you had shown the image rather than verbalizing the outcome. Sometimes it is the higher PCs, despite describing only a small fraction of variance, that capture the outliers better. I don't have time or proper knowledge to explain why that is, but I know from experience that it is the case for many diverse datasets. Some of it is explained here and Google will help you find additional info. Yet another option is to try robust PCA algorithms, which are designed to deal with datasets that have corrupt data points. This toolkit may be useful as well.

ADD COMMENT • link 3.3 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Good to get the Python perspective!

ADD REPLY • link 3.3 years ago by Kevin Blighe 87k

score 5 · Accepted Answer · 2020-12-30

To determine an outlier is usually a judgement call and is something that comes with experience of having worked on dozens —possibly hundreds— of datasets.

The numbers on the PCA axes are unfortunately not a good metric to use on their own.

PCA

Stat ellipse

You could instead generate a stat ellipse at the 95% confidence level, as I do HERE, where an outlier would be any sample falling outside of it's respective group's ellipse:

Z-scores

You could also generate Z-scores from the PC1 values and determine an outlier as anything falling outside |Z|=3 or |Z|=6.

-----------------------

Hierarchical clustering

In a dendrogram, an outlier will lie in its own branch that may extend from the very root of the tree. You can again attempt to quantify these by setting cut-offs based on the distance metric that's used. For example, if a sample branches off into it's own leaf / node at a height of Euclidean Distance of 8, then it may be an outlier.

Take a quick look at what I do here: A: extract dendrogram cluster from pheatmap

-----------------

General

Cook's Distance: Cook's Distance is a metric also routinely used in statistics.
+/- 1.5 * IQR: This is commonly used in statistics and there is much material online about it
Bonferroni test on studentised residuals: If you feel up for it, you can try to implement this, but it depends on your input data. I cannot really see it being used in your case - https://www.rdocumentation.org/packages/car/versions/3.0-10/topics/outlierTest