How are principal components generated directly from eigendecomposition of a 'sample' covariance matrix?
0
0
Entering edit mode
4 months ago

I have lcWGS SNP data for about 100 samples that I would like to plot up in a PCA to show structure among the samples based on their genetic diversity.

I am using ANGSD to generate genotype likelihoods and then PCAngsd to generate a covariance matrix and finally a PCA.

However, I'm facing some confusion with how the PCA is generated.

PCA is commonly explained by generating a p x p dimension covariance matrix 𝐂 from a centered data matrix 𝐗 that is n x p (n=samples, p=variables), by doing:

𝐂=𝐗⊤𝐗/(𝑛−1).

One can then perform eigendecomposition of 𝐂 to give eigenvectors and eigenvalues. The original matrix 𝐗 can then be transformed into the space of the eigenvectors through a linear transformation (thus generating the principal components; although I appreciate that some people refer to the eigenvectors as the principal components). E.g. see this excellent explanation.

However, in the PCAngsd tutorial, the covariance matrix that is generated is of dimension n x n, rather than p x p, and the PCA plot is generated by directly plotting the results of an eigendecomposition of this covariance matrix with the code e.g.:

## Open R

e<-eigen(cov)

plot(e\$vectors[,1:2])


I'm sure this is correct, but I'm having trouble understanding how the eigendecomposition of this n x n 'sample' covariance matrix can give the principal components of the samples (i.e. the coordinates of the samples in the eigenvector space) directly. How is this possible? I've read through the PCAngsd paper, and some of the foundational works e.g. Patterson 2006, and see that their covariance matrix is generated as

𝐂=𝐗𝐗⊤/(𝑛−1).

but I can't work out why having an n x n covariance matrix enables the generation of PCs directly from eigendecomposition (perhaps due to my lack of mathematical fluency).

Does anybody have an intuitive explanation?

ANGSD PCA PCAngsd • 162 views