Hello everyone, I was looking at some PCA plots taken from an experiment. These data don't cluster in their first two principal components and if I dig a little bit deeper int he analysis and look at other combinations I can see a clear clustering when comparing 5 (3% of variance) vs 6 (2% of variance).

What does that mean? We would have expected a big separation in the first 2 main principal components. Does it mean that the expected separation is minimal and is explained by the (3+2)%=5% of variance of the whole experiment?

Thanks in advance.

Could be this ^^, and, yes, generally, there is / are some other source(s) of variation in your data that is / are much greater than the expected source.

You can check for statistically significant correlations between your PCs and your metadata via: Correlate the principal components back to the clinical data

Without actually seeing your data, there's not much else that we can say.

Thanks both for the useful hints. Not sure I have ever heard about correlation amongst principal components, sorry! But once you detected other possible source of variation, how can you deal with that? Is it always about checking confounding via SVA or similar tools (btw, SVA returned something like 18-19 surrogate variables)? More importantly, what tells you that once you have corrected your data the results you're seeing are due to biology and are not due to any other artefacts (i.e. overcorrection).

I would contact the sequencing (?) team and request from them as much metadata about the samples as possible. If SVA detects that many surrogate variables, then that is somewhat worrying about the state of the data, but corroborates with the PCA result that you have encountered, i.e., that there are 1 or more other larger sources of variation in the data than are known.

Once you retrieve metadata from the sequencing team, poi, you can start to investigate which, if any, variables are responsible for the variation, and then proceed from that point.

If you actually identify 1 or 2 variables that are responsible for the unknown variation, then we go back to the

`removeBatchEffects()`

thread that you created last week (ti ricordi?)Thanks a lot Kevin for these useful information. I think it would be really important to be in touch with the sequencing facility in order to understand what is going on with my samples.

fingers crossed!

Sorry for resurrecting this old post. I am always referring to this dataset where it seems that a nice clustering between groups is mainly achieved when PC5 is plotted against PC6. This is really a curiosity-driven question but would it be possible somehow to

extract and determinewhich genes are involved in the segregation of these samples on the PC5 and 6?Yes,

PCAtoolscan do this: Determine the variables that drive variation among each PCBut, let me know how you have currently done your PCA?