Dear colleagues. I have a question regarding application and interpretation of the yielded results based on 2 techniques: PCA and sparse PCA.
I have a proteomics dataset, 10 subjects in each group ( 3 groups in total - i.e. healthy, ill, unknown ), 5 technical replicas (same plasma of the same pool of individuals). There are around 700 proteins measured using the shotgun method.
I tried to analyse the data using two pipelines:
Using technical replicas, I get the subset of proteins for which CV is less than 20 % among the technical replicas. Using this subset of features I apply PCA . Using first PC components identify which variables contribute mostly to the dimensions. After taking the same subset of proteins, I apply sparse PCA and the variables which are yielded as the most important in loadings do not intersect with those obtained in the PCA.
Apply PCA and sparse PCA on the whole set of features. Again, the results do not interest between each other and do not intersect with those, yielded in the step 1.
First of all, since PCA overfits to noise, I am not sure how applicable these methods are, given the total sample size is 35 subjects only. Second, the dataset is zero inflated - that is some proteins were not measured or the signal/ concentration was not enough to be measured. In that regard - PCA does not handle zero vectors, so all such variables are removed.
What would be the most suitable approach to do any inference in this situation ? Should sPCA be used on the data which includes zero columns? Are there other techniques to be used? The goal is to find proteins which contribute mostly for differentiation between the unknown group from two others , so that those hints would be used later for a deeper investigation.