When running pca with GCTA, why are there the same number of eigenvalues as the number of samples
2
1
Entering edit mode
2.7 years ago
? ▴ 60

After I got the filtered vcf file of snps with gatk pipeline, I tried to run PCA with gcta. afterwords, I tried to find the explained variation percentage for each PCs(principal components). what I'm confused is that the eigenvalue results I get are always the same number as the sample size. I thought that the amount of possible PCs were the same as the variable dimension I have (which is the number of snps called, about 17000000 in this case) and there should be the same amount of eigenvalues related to it. Of course there could be same eigenvalues and PCs that explain small variation are useless, but isn't it possible that there could be 17000000 PCs and eigenvalues? So I thought when I want to get the explained variation percentage of PC1, I had to divide it by the some of 17000000 eigenvalues. could someone explain why this is wrong and why I always get the same number of eigenvalues as the sample size?

GCTA snp PCA • 1.5k views
ADD COMMENT
3
Entering edit mode
2.7 years ago

I thought that the amount of possible PCs were the same as the variable dimension

If you think about it, if you have 2 points in a 3-dimensional space, you can rotate and shift the original axes in such a way that one axis passes exactly through the 2 datapoints. Therefore you don't need three principal components, with just 1 PC you can describe the 2 datapoints. In fact, if you have more dimensions than datapoints the number of PCs is at most the number of datapoints - 1. See also https://stats.stackexchange.com/questions/123318/why-are-there-only-n-1-principal-components-for-n-data-if-the-number-of-dime .

However, R's prcomp will return as many PCs as datapoints with the last PC with having stdev very close to 0. I'm not sure if this is because of the underlying linear algebra of SVD or because of the imprecision of floating-point numbers (I think the latter).

ADD COMMENT
0
Entering edit mode

Thank you for the intuitive explanation. As I thought about it more, I found that the number of eigenvalues are same as dimension size of covariance matirx and the dimension of covariance matrix is same as the number of row or column of data matrix. so when the sample size is smaller than the variable dimension, maximum number of eigenvalue is the same as sample size and when the sample size is bigger, the opposite. am I right?

ADD REPLY
3
Entering edit mode
2.7 years ago
Lemire ▴ 940

The principal components are the eigenvectors of the covariance matrix (or correlation, if you scale) of your data. Now, you have two choices: do you want to focus on the covariance between your N samples, or the covariance between your M SNPs? GCTA focuses on the former, whereas you thought it was the latter. GCTA's covariance matrix (aka GRM) is NxN, hence you will get N eigenvectors. You are not reducing the "SNP dimensions", you are reducing the "sample dimensions".

ADD COMMENT

Login before adding your answer.

Traffic: 2946 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6