Hello, I would appreciate comments/advice on when to use Principal Component Analysis and what PCA data represents. My understanding of the algorithm is that a set of correlated variables are represented as uncorrelated variables, from which one can derive an understanding of variation in the data. BUT,

Once you have represented your data as a set of principal components,

**is there some way to determine which features are actually represented in each principal component?**In other words, what does each principal component (PC) actually represent? If I understand correctly, the first 2 PCs will always be the most important to show the variation of the dataset, but how can I tie that back to the actual variables/features that I was mapping in the first place?I wish to determine which features, from a set of features (ex: hydrophobicity, amino acid composition, etc.) are the "best" to predict whether a protein sequence will adopt a desired protein fold (a specific fold I have in mind).

**Accordingly, if PCA does not do that, what is the best technique to use? I have heard of "feature selection" but I am not very familiar with it. If anyone can elaborate on if/how it differs from PCA that would be very appreciated. Are there known examples (ex: articles, reviews) in protein structure prediction that address this?**

I am intending to use R for this analysis, so any suggestions for R libraries that will do the job are most welcome! Thank you very much for your advice and responses!

-Deena

[?]Thank you all very much for your fantastic and detailed responses![?]