Question

Inferring input data suitability (quality and quantity) from PCA for machine learning classification problem.

0

Entering edit mode

2 days ago

Shakunthala Natarajan ▴ 10

I am tackling a multi-class classification problem. I have 8 classes (tissue types - Class A to H in the plot shown) for classification. I subjected my labelled input data to PCA using gene expression values across the different labelled samples. I calculated the mean of the first four principal components for each tissue type and then i found the euclidean distance of each labelled sample from its respective class (tissue type) mean. Samples that were more than 3 standard deviations away from the class mean were labelled as suspected mislabelled (outliers). I then removed these outliers and repeated my PCA without the outliers from the first trial. As expected the number of outliers reduced after every subsequent PCA. I am attaching the image of a pca plot I got after removing one set of outliers here below:

PCA plot of gene expression across different tissue types

But my questions are as follows:

How do I know when to stop? How can I use my PCA to tell that the input data I have is grouped well for training an ML Model for classification?
Given that each round of PCA coupled with subsequent outlier removal is bound to reduce the number of data instances I have, how to balance the data points and their grouping?
Is there a way to quantitatively determine that I have enough number of well grouped samples for the training?

Any help is appreciated. Thank you!

outliers datasets pca classification quality • 545 views

ADD COMMENT • link updated 1 day ago by Kevin Blighe 89k • written 2 days ago by Shakunthala Natarajan ▴ 10

3

Entering edit mode

I would tell you to stop with this altogether, or at most after a single iteration. There is no point in trying to get the perfect data for training when it is very likely that the model will be used to classify imperfect data. Also, using 2 PC components that explain barely over 30% of variance combined doesn't strike me as the best criterion to decide what is an outlier.

ADD REPLY • link 2 days ago by Mensur Dlakic ★ 30k

0

Entering edit mode

Sample A is well-separated. We have no labels here for 'condition' / 'disease'.

<30% explained variance can explain 100% of a subtype of a disease. Let's not throw away the towel.

ADD REPLY • link 1 day ago by Kevin Blighe 89k

0

Entering edit mode

I think we are talking about different things. What I said:

Using 2 PC components that explain barely over 30% of variance combined doesn't strike me as the best criterion to decide what is an outlier.

ADD REPLY • link 1 day ago by Mensur Dlakic ★ 30k

0

Entering edit mode

I see your point of view - thanks for clarifying.

ADD REPLY • link 1 day ago by Kevin Blighe 89k

0

Entering edit mode

Hi Mensur. Thank you for your inputs. I think I understand your point about flagging outliers based on the first two PCs that cover for only 30% of the variance, which could mean that they might not be outliers at all when further PCs are considered. Is that right? Would it help if I find the mean for each tissue type across more PCs (say PC1 to PC10) and use that to calculate the standard deviation and flag outliers?

ADD REPLY • link 1 day ago by Shakunthala Natarajan ▴ 10

score 1 · Answer 1 · 2025-11-10

How do I know when to stop? How can I use my PCA to tell that the input data I have is grouped well for training an ML Model for classification?

From my experience and interpretation of your PCA bi-plot, your data is ready for ML modeling; however, you should take note of the fact that Sample A clusters away from the remaining samples.

Given that each round of PCA coupled with subsequent outlier removal is bound to reduce the number of data instances I have, how to balance the data points and their grouping?

Please clarify what you mean by 'round'? - you have not presented any information pertaining to outliers --whether graphical or otherwise--; as such, I cannot comment.

Is there a way to quantitatively determine that I have enough number of well grouped samples for the training?

The PCA bi-plot indicates to me that you are ripe / ready for downstream modeling.

Kevin.