Question

PCA on significant features from univariate analysis in metabolomics - Acceptable or not?

0

Entering edit mode

4.4 years ago

augihol ▴ 20

I have a very large dataset on a hetereogeneous patient population and heatlhy controls. I have performed univariate analysis on metabolites between the two main groups, and performed a PCA based on the variables that had an unadjusted pvalue of less than 0.05.

Without this step I only got a marginal seperation on PCA.

This gave meaningfull seperation on the PCA and hieriarichial clustering.

The issue is that I am afraid of performing a type 1 error.

So what do you think, is it ok to perform PCA on significant variables with unadjusted pvalues <0.05 from univariate analysis?

R • 1.1k views

ADD COMMENT • link 4.4 years ago by augihol ▴ 20

1

Entering edit mode

I am not a statistician, so no expertise opinion but my two cents. DESeq2 uses the (by default top 500) most variable regions/genes (whatever you measure) in its plotPCA function as determined by naive rowVars(). If we accept that the authors are experts on the matter and know what they do, I do not see why a more elaborate and reliable feature selection (= your differential analysis) should induce any biases. In contrast, you select those features that are reliably different between your groups and therefore should be robust markers to separate the samples into groups. Just thinking aloud, please comment if non-sense.

ADD REPLY • link 4.4 years ago by ATpoint 81k

0

Entering edit mode

Hey, thank you for the reply. I think you have a good point. I have tried some similar technique as they do in DESeq, where i select highly variable features based on CV, varianece etc. This also give some clustering to some extent, but not meaningfull.

The issue is that untargeted Metabolomics on a patient population often will have alot of technical and meaningless biological variance due to patient age, diet and sex (noise I guess). The results from PCA uandjusted pvalues just seems more fine clustered, and give meaningfull subclustering within the patients. When we looked at protein level, from some ELISA data they had from previous biomarker study, markers were significantly up and down based on the clusters I extracted from PCA on sig variables.

The issue is just that most metabolomic studies seem to use all metabolites in their multivariate analyses, so I am a but unsure atm if its ok to do

ADD REPLY • link 4.4 years ago by augihol ▴ 20

1

Entering edit mode

Your metabolomics dataset is likely already filtered for just the known / identified metabolites; so, the starting size should be ~600 variables - correct? You then further filtered these for those that had un-adjusted p-values < 0.05. As this is metabolomics data, likely nothing passed FDR correction.

I would be more interested in the PCA analysis on the unfiltered dataset, to be honest. I have rarely (or maybe even never) seen PCA used on a dataset filtered for statistically significant variables. However, if you observe meaningful clusters from PCA in this way, then they should also be visible through hierarchical clustering?

As you have implied, metabolomics data is very sensitive to certain parameters, such as diet, time of day of sampling, post-processing, storage, etc.

ADD REPLY • link 4.4 years ago by Kevin Blighe 87k

0

Entering edit mode

Hey thank you very much for you reply! You calculated correctly, 600 or so metabolites that passed filtering. I did get approx 50ish metabolites that were FDR significant, while 129 with unadjusted significance.

When I clustered with FDR significant metabolites, I got an extremeley well clustered PCA, however I felt this will sort of overfit the data.

I work on a disease that is not well categorized metabolically, only clinically. So people tend to treat it as one disease.

It is worth to mention that I used p-values from the whole patient group compared to healthy controls.

The results from hierarchical clustering based on the signficant uandjusted p-values gave 3 well defined subtypes of the disease, while healthy clustered well in its own group. I made an color overlay of subgroups from hierarichial clustering onto PCA; which showed that all patient subgroups had their own defined place on the PCA.

I further did a univariate analysis on each cluster(patient subgroup) vs. healthy controls, and then alot of interesting biological pattern appeared that suggest two metabolic phenotypes.

I speculate that the previous approach of comparing all patients together up against healthy controls, made the differences within the patient group cancel each other out.

Its tricky to work with, but somehow it just makes sense with curent approach.

I appreciate any critique or suggestion for I might improve my approach :)

ADD REPLY • link 4.4 years ago by augihol ▴ 20

1

Entering edit mode

Okay, it seems that you are doing okay, in that case. I have published a few papers in the metabolomics field and I know that there are no standards. You did not mention which test you are using, but, in our case, we used binary logistic regression models of the form:

glm(DiseaseStatus ~ metabolite)

...and extracted a p-value from that. In some cases, we adjusted for covariates.

The input data was Z-scaled, log2-transformed metabolite levels, which had previously undergone rigorous QC to eliminate highly variable metabolites in the controls.

ADD REPLY • link 4.4 years ago by Kevin Blighe 87k

score 2 · Accepted Answer · 2019-11-26

It is worth trying non-linear dimensionality reduction techniques (on full data) such as t-SNE or UMAP, as they have been known to work better than PCA in some cases. t-SNE will be very slow if you have large number of samples, say >10000, so you may want to do PCA with 50 principal components and do t-SNE on that modified dataset.

See here for links and images.