I am tackling a multi-class classification problem. I have 8 classes (tissue types - Class A to H in the plot shown) for classification. I subjected my labelled input data to PCA using gene expression values across the different labelled samples. I calculated the mean of the first four principal components for each tissue type and then i found the euclidean distance of each labelled sample from its respective class (tissue type) mean. Samples that were more than 3 standard deviations away from the class mean were labelled as suspected mislabelled (outliers). I then removed these outliers and repeated my PCA without the outliers from the first trial. As expected the number of outliers reduced after every subsequent PCA. I am attaching the image of a pca plot I got after removing one set of outliers here below:
But my questions are as follows:
- How do I know when to stop? How can I use my PCA to tell that the input data I have is grouped well for training an ML Model for classification?
- Given that each round of PCA coupled with subsequent outlier removal is bound to reduce the number of data instances I have, how to balance the data points and their grouping?
- Is there a way to quantitatively determine that I have enough number of well grouped samples for the training?
Any help is appreciated. Thank you!
I would tell you to stop with this altogether, or at most after a single iteration. There is no point in trying to get the perfect data for training when it is very likely that the model will be used to classify imperfect data. Also, using 2 PC components that explain barely over 30% of variance combined doesn't strike me as the best criterion to decide what is an outlier.
Sample A is well-separated. We have no labels here for 'condition' / 'disease'.
<30% explained variance can explain 100% of a subtype of a disease. Let's not throw away the towel.
I think we are talking about different things. What I said:
I see your point of view - thanks for clarifying.
Hi Mensur. Thank you for your inputs. I think I understand your point about flagging outliers based on the first two PCs that cover for only 30% of the variance, which could mean that they might not be outliers at all when further PCs are considered. Is that right? Would it help if I find the mean for each tissue type across more PCs (say PC1 to PC10) and use that to calculate the standard deviation and flag outliers?