I was wondering if someone could explain to me this property of the surrogate variables values I get from bioconductors sva...I'm very new to the package. Specifically, I'm performing a pca on the surrogate variable matrix from svobj. the principal components I'm getting seem weirdly equal...
I performed sva on a dataset and it estimated that there were 4 surrogate variables.
I downloaded the surrogate variable (sv) matrix from svobj:
sv1 sv2 sv3 sv4 1 0.170511776 -0.026039142 0.155162179 -0.052378086 2 -0.031146292 0.231081859 -0.119285616 0.020441932 3 0.304738317 -0.114059097 0.056133569 -0.008361104 4 0.384487981 0.222407059 -0.001225998 -0.003543087 5 -0.100784593 -0.076275696 0.013916598 -0.087402628 6 -0.091898903 -0.159580076 0.210261199 -0.042860031 7 0.006998733 0.021321322 -0.018007686 -0.009117072 8 -0.042037192 0.161543154 0.111127593 -0.207275659 9 0.113874692 -0.064348147 -0.102071872 -0.14602898 ...
I ran principal component analysis on it (samples as rows, surrogate variables as columns).
And taking a look at my output:
> sv.pca Standard deviations (1, .., p=4):  1 1 1 1 Rotation (n x k) = (4 x 4): PC1 PC2 PC3 PC4 sv1 0.18920596 0.65779848 -0.6119765 0.3962158 sv2 0.08862912 -0.73074539 -0.4087601 0.5395102 sv3 0.67067428 0.08332028 0.5861903 0.4468049 sv4 -0.71171764 0.16238861 0.3387932 0.5935546
And a scree plot of these Principal components shows each principal component accounts for exactly 25% of the variation. Taking a look at the standard deviations in sv.pca they are all 1, and the pca plots all show a similar buckshot pattern of equal scales.
So why is this? Is this to be expected given what Im looking at are surrogate variables? Or is this a product of the way sva scales its data and divides up the workload of accounting for variance?
I know that its not a smart idea to start thinking "what should happen" and "what should be the case" when doing statistical analysis. But that exactly 25% for each principal component has me thinking I must either be doing something very wrong or have a major hole in my understanding of what I'm working with here.
Here is what I'm thinking, perhaps you can correct my misunderstanding of what is happening in this analysis if thats what this is: It seems to me that if surrogate variables are an adjustment for unknown variables skewing data in some direction or other it would be more likely than not that there would be one set of unseen variables (those contributing to surrogate variable x) that was more potent, even slightly, than others (those in surrogate variable y). I mean if I imagine a study comparing the expression data of patients with chronic anxiety to controls and age, gender and, BMI all happen to be the unknown variables that get accounted for by surrogate variable X when running sva, while dental hygiene, coffee consumption, and literary preference wind up being accounted for in surrogate variable Y... then I'd think that while together sv X and sv Y would account for all of the expression heterogeneity in the study, the values recorded for SV X would be significantly different than those in SVY and plotting values from them in any old PCA, you would probably get different eigenvalues for the principal components. Im not talking about any necessarily extreme difference or strong pattern, I'd just expect there to be SOME difference. I find it weird that my data would have a perfectly balance among all four pc's when i look at the variance between these surrogate variable values. Or have I missed something fundamental here?
I'd be very grateful if someone could shed light on this for me.