Question

PCA vs sparse PCA : differences in results

0

Entering edit mode

6 months ago

Ariadna ▴ 20

Dear colleagues. I have a question regarding application and interpretation of the yielded results based on 2 techniques: PCA and sparse PCA.

I have a proteomics dataset, 10 subjects in each group ( 3 groups in total - i.e. healthy, ill, unknown ), 5 technical replicas (same plasma of the same pool of individuals). There are around 700 proteins measured using the shotgun method.

I tried to analyse the data using two pipelines:

Using technical replicas, I get the subset of proteins for which CV is less than 20 % among the technical replicas. Using this subset of features I apply PCA . Using first PC components identify which variables contribute mostly to the dimensions. After taking the same subset of proteins, I apply sparse PCA and the variables which are yielded as the most important in loadings do not intersect with those obtained in the PCA.
Apply PCA and sparse PCA on the whole set of features. Again, the results do not interest between each other and do not intersect with those, yielded in the step 1.

First of all, since PCA overfits to noise, I am not sure how applicable these methods are, given the total sample size is 35 subjects only. Second, the dataset is zero inflated - that is some proteins were not measured or the signal/ concentration was not enough to be measured. In that regard - PCA does not handle zero vectors, so all such variables are removed.

What would be the most suitable approach to do any inference in this situation ? Should sPCA be used on the data which includes zero columns? Are there other techniques to be used? The goal is to find proteins which contribute mostly for differentiation between the unknown group from two others , so that those hints would be used later for a deeper investigation.

ML • 570 views

ADD COMMENT • link updated 6 months ago by Ram 45k • written 6 months ago by Ariadna ▴ 20

score 0 · Answer 1 · 2025-04-28

There are some things you didn't explain here, so I will deal with what's known.

PCA generally doesn't overfit to noise. When applied correctly, it eliminates the noise. That said, if the datasets are mostly noise, there is no method that will extract meaningful results out of them.

Not sure what the rationale is for applying sparse PCA on dense data, unless you think that zero values that were measured as such and missing values are equivalent. It appears that in your experiments zeros don't necessarily mean missing data, so I don't see justification for sparsifying that matrix.

PCA works properly only on normally distributed data. Since you mention bias from zero values, I suspect that you didn't normalize the data. I suggest you do so, then apply PCA. As normalization will break the sparsity structure of the data, I suggest you stay away from sparse PCA. You may want to try truncated singular value decomposition (tSVD), as it doesn't require data normality and works with sparse datasets.