Question

When running pca with GCTA, why are there the same number of eigenvalues as the number of samples

1

Entering edit mode

2.7 years ago

? ▴ 60

After I got the filtered vcf file of snps with gatk pipeline, I tried to run PCA with gcta. afterwords, I tried to find the explained variation percentage for each PCs(principal components). what I'm confused is that the eigenvalue results I get are always the same number as the sample size. I thought that the amount of possible PCs were the same as the variable dimension I have (which is the number of snps called, about 17000000 in this case) and there should be the same amount of eigenvalues related to it. Of course there could be same eigenvalues and PCs that explain small variation are useless, but isn't it possible that there could be 17000000 PCs and eigenvalues? So I thought when I want to get the explained variation percentage of PC1, I had to divide it by the some of 17000000 eigenvalues. could someone explain why this is wrong and why I always get the same number of eigenvalues as the sample size?

GCTA snp PCA • 1.5k views

ADD COMMENT • link 2.7 years ago by ? ▴ 60

score 3 · Accepted Answer · 2021-08-30

I thought that the amount of possible PCs were the same as the variable dimension

If you think about it, if you have 2 points in a 3-dimensional space, you can rotate and shift the original axes in such a way that one axis passes exactly through the 2 datapoints. Therefore you don't need three principal components, with just 1 PC you can describe the 2 datapoints. In fact, if you have more dimensions than datapoints the number of PCs is at most the number of datapoints - 1. See also https://stats.stackexchange.com/questions/123318/why-are-there-only-n-1-principal-components-for-n-data-if-the-number-of-dime .

However, R's prcomp will return as many PCs as datapoints with the last PC with having stdev very close to 0. I'm not sure if this is because of the underlying linear algebra of SVD or because of the imprecision of floating-point numbers (I think the latter).

score 3 · Accepted Answer · 2021-08-30

The principal components are the eigenvectors of the covariance matrix (or correlation, if you scale) of your data. Now, you have two choices: do you want to focus on the covariance between your N samples, or the covariance between your M SNPs? GCTA focuses on the former, whereas you thought it was the latter. GCTA's covariance matrix (aka GRM) is NxN, hence you will get N eigenvectors. You are not reducing the "SNP dimensions", you are reducing the "sample dimensions".