Question

PCA plots and biological information taken from other principal components

1

Entering edit mode

3.9 years ago

Mozart ▴ 330

Hello everyone, I was looking at some PCA plots taken from an experiment. These data don't cluster in their first two principal components and if I dig a little bit deeper int he analysis and look at other combinations I can see a clear clustering when comparing 5 (3% of variance) vs 6 (2% of variance).

What does that mean? We would have expected a big separation in the first 2 main principal components. Does it mean that the expected separation is minimal and is explained by the (3+2)%=5% of variance of the whole experiment?

Thanks in advance.

pca plot RNA-Seq • 1.2k views

ADD COMMENT • link 3.9 years ago by Mozart ▴ 330

score 3 · Accepted Answer · 2020-05-09

3

Entering edit mode

3.9 years ago

WouterDeCoster 47k

Could it be that there are technical confounders leading to more variability than the biological differences?

ADD COMMENT • link 3.9 years ago by WouterDeCoster 47k

2

Entering edit mode

Could be this ^^, and, yes, generally, there is / are some other source(s) of variation in your data that is / are much greater than the expected source.

You can check for statistically significant correlations between your PCs and your metadata via: Correlate the principal components back to the clinical data

Without actually seeing your data, there's not much else that we can say.

ADD REPLY • link 3.9 years ago by Kevin Blighe 87k

0

Entering edit mode

Thanks both for the useful hints. Not sure I have ever heard about correlation amongst principal components, sorry! But once you detected other possible source of variation, how can you deal with that? Is it always about checking confounding via SVA or similar tools (btw, SVA returned something like 18-19 surrogate variables)? More importantly, what tells you that once you have corrected your data the results you're seeing are due to biology and are not due to any other artefacts (i.e. overcorrection).

ADD REPLY • link 3.9 years ago by Mozart ▴ 330

2

Entering edit mode

I would contact the sequencing (?) team and request from them as much metadata about the samples as possible. If SVA detects that many surrogate variables, then that is somewhat worrying about the state of the data, but corroborates with the PCA result that you have encountered, i.e., that there are 1 or more other larger sources of variation in the data than are known.

Once you retrieve metadata from the sequencing team, poi, you can start to investigate which, if any, variables are responsible for the variation, and then proceed from that point.

If you actually identify 1 or 2 variables that are responsible for the unknown variation, then we go back to the removeBatchEffects() thread that you created last week (ti ricordi?)

ADD REPLY • link 3.9 years ago by Kevin Blighe 87k

1

Entering edit mode

Thanks a lot Kevin for these useful information. I think it would be really important to be in touch with the sequencing facility in order to understand what is going on with my samples.

fingers crossed!

ADD REPLY • link 3.9 years ago by Mozart ▴ 330

0

Entering edit mode

Sorry for resurrecting this old post. I am always referring to this dataset where it seems that a nice clustering between groups is mainly achieved when PC5 is plotted against PC6. This is really a curiosity-driven question but would it be possible somehow to extract and determine which genes are involved in the segregation of these samples on the PC5 and 6?

ADD REPLY • link 3.5 years ago by Mozart ▴ 330

0

Entering edit mode

Yes, PCAtools can do this: Determine the variables that drive variation among each PC

But, let me know how you have currently done your PCA?

ADD REPLY • link 3.5 years ago by Kevin Blighe 87k