Question: Principal component analysis of SVA's surrogate variables
0
gravatar for RNAseqer
5 months ago by
RNAseqer 110
RNAseqer 110 wrote:

I was wondering if someone could explain to me this property of the surrogate variables values I get from bioconductors sva...I'm very new to the package. Specifically, I'm performing a pca on the surrogate variable matrix from svobj. the principal components I'm getting seem weirdly equal...

I performed sva on a dataset and it estimated that there were 4 surrogate variables.

I downloaded the surrogate variable (sv) matrix from svobj:

    sv1 sv2 sv3 sv4
1   0.170511776 -0.026039142    0.155162179 -0.052378086
2   -0.031146292    0.231081859 -0.119285616    0.020441932
3   0.304738317 -0.114059097    0.056133569 -0.008361104
4   0.384487981 0.222407059 -0.001225998    -0.003543087
5   -0.100784593    -0.076275696    0.013916598 -0.087402628
6   -0.091898903    -0.159580076    0.210261199 -0.042860031
7   0.006998733 0.021321322 -0.018007686    -0.009117072
8   -0.042037192    0.161543154 0.111127593 -0.207275659
9   0.113874692 -0.064348147    -0.102071872    -0.14602898
...

I ran principal component analysis on it (samples as rows, surrogate variables as columns).

sv.pca <-prcomp(sv.mat,scale=TRUE)

And taking a look at my output:

> sv.pca
Standard deviations (1, .., p=4):
[1] 1 1 1 1

Rotation (n x k) = (4 x 4):
            PC1         PC2        PC3       PC4
sv1  0.18920596  0.65779848 -0.6119765 0.3962158
sv2  0.08862912 -0.73074539 -0.4087601 0.5395102
sv3  0.67067428  0.08332028  0.5861903 0.4468049
sv4 -0.71171764  0.16238861  0.3387932 0.5935546

And a scree plot of these Principal components shows each principal component accounts for exactly 25% of the variation. Taking a look at the standard deviations in sv.pca they are all 1, and the pca plots all show a similar buckshot pattern of equal scales.

So why is this? Is this to be expected given what Im looking at are surrogate variables? Or is this a product of the way sva scales its data and divides up the workload of accounting for variance?

I know that its not a smart idea to start thinking "what should happen" and "what should be the case" when doing statistical analysis. But that exactly 25% for each principal component has me thinking I must either be doing something very wrong or have a major hole in my understanding of what I'm working with here.

Here is what I'm thinking, perhaps you can correct my misunderstanding of what is happening in this analysis if thats what this is: It seems to me that if surrogate variables are an adjustment for unknown variables skewing data in some direction or other it would be more likely than not that there would be one set of unseen variables (those contributing to surrogate variable x) that was more potent, even slightly, than others (those in surrogate variable y). I mean if I imagine a study comparing the expression data of patients with chronic anxiety to controls and age, gender and, BMI all happen to be the unknown variables that get accounted for by surrogate variable X when running sva, while dental hygiene, coffee consumption, and literary preference wind up being accounted for in surrogate variable Y... then I'd think that while together sv X and sv Y would account for all of the expression heterogeneity in the study, the values recorded for SV X would be significantly different than those in SVY and plotting values from them in any old PCA, you would probably get different eigenvalues for the principal components. Im not talking about any necessarily extreme difference or strong pattern, I'd just expect there to be SOME difference. I find it weird that my data would have a perfectly balance among all four pc's when i look at the variance between these surrogate variable values. Or have I missed something fundamental here?

I'd be very grateful if someone could shed light on this for me.

surrogate variables stdev sva • 266 views
ADD COMMENTlink modified 5 months ago by geek_y9.9k • written 5 months ago by RNAseqer 110

What is your goal in running PCA on surrogate variables ?

ADD REPLYlink written 5 months ago by geek_y9.9k

To identify principal components responsible for their spread. They are like any other collection of datapoints aren't they? Is there any reason you can't perform dimension reduction on them?

And more than anything I'm really just curious as to why this breakdown of four equal principal components happened. The fact that I have no idea why that might be says I've got a large gap in my understanding and I'm trying to fill it.

ADD REPLYlink modified 5 months ago • written 5 months ago by RNAseqer 110
2
gravatar for geek_y
5 months ago by
geek_y9.9k
Barcelona
geek_y9.9k wrote:

You can find correlation (cor()) of PCs and sva identified surrogate variables to find which PCs (the percent of variation across samples) captured by the hidden covariates or vice-versa. Not PCs ON surrogate variables. PCs on surrogate variables is not meaningful, to my knowledge.

ADD COMMENTlink modified 5 months ago • written 5 months ago by geek_y9.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1766 users visited in the last hour