Entering edit mode
7 months ago
Ahmed
•
0
I am getting completely different results when I conduct PCA on PLINK and on HAIL - does anyone know why? When I say the results are different I mean:
- Comparing the pearson correlation between the top 10 PC's there is 0 correlation
- When I create a PCA scatter plot I get completely different looking clusters suggesting different population stratification
Points to note:
- Its the same set of samples and SNPs (I am using the same .bed/.bim/fam files)
- I did QC on the dataset prior (including LD pruning, MAF > 0.05, genotype > 0.95). From the hail info none of the SNPs are being removed (it says the number of SNPS left after filtering is the same as I had in my .bim file)
- When I use another software (bigsnpr) I get clusters close to what I get in Hail.
My commands are as follows:
HAIL v0.2
hl.import_plink(bed =file.bed, bim =file.bim, fam =file.fam, reference_genome='GRCh38' ).write("file.mt', overwrite = True)
samples = hl.read_matrix_table('file.mt')
pca_evals_s, pca_scores_s, pca_loadings_s = hl.hwe_normalized_pca(samples.GT, k=10, compute_loadings=True)
PLINK2.0
plink2.0 --bfile file --pca 10 --out plink_pca --threads 14
EDIT
The issue only happens with plink2.0 and not with plink1.9
Thank you!
If you run
plink2 --version
what is the result?