R SNPRelate and PLINK PCA eigenvectors negatives of each other
1
0
Entering edit mode
4.1 years ago
dumeir ▴ 20

Hi,

I'm using PLINK (version 1.9) and encountered a memory error while using the --pca command, so I decided to use SNPRelate in RStudio instead.

To test it out, I compared the eigenvectors and eigenvalues I got from a PLINK PCA of a different dataset to the results of the SNPRelate PCA. The values are pretty similar, but they just seem to be of different signs? I'm not sure why this occurred and how it can be resolved, and which eigenvectors are the correct ones.

Note: file names and sample ids changed.

PLINK

PLINK v1.90b6.15 64-bit (21 Jan 2020)

Options in effect:
  --bfile data_pruned
  --out data_PC
  --pca 10

Random number seed: 1583390967
8192 MB RAM detected; reserving 4096 MB for main workspace.
352395 variants loaded from .bim file.
491 people (0 males, 491 females) loaded from .fam.
491 phenotype values loaded from .fam.
Using up to 4 threads (change this with --threads).
Before main variant filters, 491 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.998232.
352395 variants and 491 people pass filters and QC.
Among remaining phenotypes, 204 are cases and 287 are controls.
Excluding 7715 variants on non-autosomes from relationship matrix calc.
Relationship matrix calculation complete.
--pca: Results saved to data_PC.eigenval and data_PC.eigenvec .

data_PC.eigenvec

sample_1 -3.28359e-05 0.0472202 -0.033297

sample_2 0.0185079 -0.123125 -0.048018

sample_3 0.00786112 -0.066443 -0.0107746

sample_4 -0.00904159 -0.0490187 0.0346168

sample_5 0.00430433 0.0795848 -0.0236887

SNPRelate

#test using data
bed1 <- "data_pruned.bed"
fam1 <- "data_pruned.fam"
bim1 <- "data_pruned.bim"
snpgdsBED2GDS(bed1,fam1,bim1,"data.gds")
genofile <- snpgdsOpen("data.gds")
pca1 <- snpgdsPCA(genofile)
df <- data.framesample.id=pca1$sample.id,
                 EV1 = pca1$eigenvect[,1],
                 EV2 = pca1$eigenvect[,2],
                 EV3 = pca1$eigenvect[,3],
                 stringsAsFactors= FALSE)

.

 > snpgdsPCA(genofile)
Principal Component Analysis (PCA) on genotypes:
Excluding 7,749 SNPs on non-autosomes
Excluding 0 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
Working space: 491 samples, 344,646 SNPs
    using 1 (CPU) core
PCA:    the sum of all selected genotypes (0,1,2) = 87296725
CPU capabilities: Double-Precision SSE2
Sat Apr  4 23:52:15 2020    (internal increment: 724)
[==================================================] 100%, completed, 19s 
Sat Apr  4 23:52:34 2020    Begin (eigenvalues and eigenvectors)
Sat Apr  4 23:52:34 2020    Done.

.

head(df)

sample.id EV1 EV2 EV3

1 sample_1 2.202874e-05 -0.04652531 0.03281472

2 sample_2 -1.801974e-02 0.11138703 0.04055612

3 sample_3 -7.873690e-03 0.06720024 0.01055124

4 sample_4 9.048831e-03 0.04950324 -0.03596400

5 sample_5 -4.337900e-03 -0.08086678 0.02444902

R SNPRelate GWAS PLINK Bioconductor • 1.4k views
ADD COMMENT
0
Entering edit mode

Please don't delete posts once they have received a comment/answer.

ADD REPLY
0
Entering edit mode
4.1 years ago
  1. The sign of the PCs is meaningless.
  2. When you have too many samples for basic --pca, plink 2.0's --pca approx usually works.
ADD COMMENT

Login before adding your answer.

Traffic: 2339 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6