Question

What is the relationship between PLINK --pca and --mds-plot in terms of % variance explained

0

Entering edit mode

4.1 years ago

Luther_Blisset ▴ 30

Hi,

I am trying to get the correct percent of variance explained by a PLINK PCA analysis (--mds-plot).

When I run the code with a parameter of 10 PCAs...

plink --vcf <vcf_file.vcf.gz> --mds-plot 10 eigendecomp --cluster --double-id --allow-extra-chr --out <out_file.pca>

...as expected I get ten columns (C1 - C10).

Then when I run the separate code to get the eigenvalues...

 plink --vcf <vcf_file.vcf.gz> --pca --double-id --allow-extra-chr --out <out_file.pca>

...the default returns a list of 20 eigenvalues.

When I re-run the code this time specifying only 10 eigen values...

 plink --vcf <vcf_file.vcf.gz> --pca 10 --double-id --allow-extra-chr --out <out_file.pca>

I get a list of 10 eigen values.

Checking the differences in the two files however, I notce that the second is just a truncated list of the first:

and

Therefore, am I correct in saying that I need all of the eigenvalues (all 20) from the first eigenvalues file to make sure I have the correct % of variance explained, even though --mds-plot only outputs 10?

In other words, is the formula for the correct variance explained (in R code):

eigs <- fread(<eigenvals_file>)
var_explained <- (eigs/sum(eigs))*100

or

eigs <- fread(<eigenvals_file>)
var_explained <- (eigs[1:10]/sum(eigs[1:10]))*100

Thanks

plink pca eigenvalues mds • 2.5k views

ADD COMMENT • link updated 4.1 years ago by h.mon 35k • written 4.1 years ago by Luther_Blisset ▴ 30

3

Entering edit mode

You need ALL eigenvalues, not just the first 20. The default is to output just 20, but there are actually as many eigenvalues as you have samples. Count your number of samples, and pass that as argument. In practice, at some point the eigenvalues get really small, but depending on your sample size these can add up quickly.

ADD REPLY • link 4.1 years ago by Lemire ▴ 940

0

Entering edit mode

I see, thank you. So when I call --mds-plot 10, does this return a compression of the data into just 10 components, or is it outputting a truncated list of as many components as there are variables (as you would expect from a conventional PCA)?

In other words is --mds-plot 10 doing the analagous analysis to --pca 10, and just truncating the output, or is the analysis actually just using 10 components? Hope that is clear.

This is important because if --mds-plot 10 is not just truncating then surely the outputs won't match up. (The --mds-plot analysis has only used 10 components, so the eigenvals will be different to those used in --pca 10). Is that right?

ADD REPLY • link 4.1 years ago by Luther_Blisset ▴ 30

1

Entering edit mode

MDS, just like PCA, has as many dimensions as there are variables. I don't know the specifics of the implementation, but my guess is if you specify 10, it will likely just give you the first 10 dimensions. This means that if you ask for 10, and then 20 in a second run, the first 10 should be the same in both runs.

ADD REPLY • link 4.1 years ago by Lemire ▴ 940