Hi there,
I'm doing my first experiments with PCA and UMAP as dimensionality reductions to visualize a dataset I've been working on. Basically, I used the samples from the SGDP which I then mapped on the human pangenome for, finally, calling small variants with DeepVariant
.
I moved on with some PopGen analyses and as a preliminary inspection of groups in this panel I'm doing a PCA with Plink2
. Now, starting from the joint callset for this ~300 samples I removed genomic regions which could be troublesome e.g. repeats, cent&sat, low mappability and SDs. Following this I attempted my first PCA but, for some reason, samples are smeared all over the plot... (see figure below)
Looking up, I found this old but very useful post on how things should have been done. That is, I should have removed INDELs and focused on bi-allelic SNPs. So, my next step has been to run the following on my VCF file
bcftools norm -m+ $VCF | bcftools view -m2 -M2 -v snps -Oz -o $new_file_name
However, the result didn't change significantly. The smearing issue persists and there are no defined clusters/groups in the plot...
For reference this are the Plink2
commands I'm using to generate the eigenvec and eigenval files to use for plotting
./plink2 --vcf $VCF --set-missing-var-ids @:#:\$r:\$a --rm-dup --indep-pairwise 200kb 0.5 --not-chr X,Y,MT --vcf-half-call m --out SGDP_snps_bi_norm
./plink2 --vcf $VCF --set-missing-var-ids @:#:\$r:\$a --not-chr X,Y,MT --vcf-half-call m --maf 0.05 --extract SGDP_snps_bi_norm.prune.in --make-pgen --pca --out SGDP_snps_bi_norm
which I double-checked with the author of the tool. I'm kind of lost on what's going wrong, if anyone has more experience with this type of analysis any help is much appreciated. Thanks in advance!
What actually is your goal of the PCA? What do you want to visualize, is it just the ancestry of your samples?
DBScan not so much ancestry, but rather how the populations of this dataset cluster based on the reported place of origin where samples have been sequenced.In theory, individuals from the same place which belong to the same population in the dataset should stick together in the plot emphasizing their greater genetic similarity.
The PCA plot looks in theory good, maybe you just made an error in assigning the population to the right sample? Basically the samples on the right side should belong to AFR, and the ones at the bottom should be EUR.
And that was my first thought too, even before looking up for solutions here; however, I cross-referenced the metadata of the dataset multiple times and there are no errors in population-to-sample assignment...
Maybe try a different program for PCA then? For a quick ancestry classification, you can use
somalier
. https://github.com/brentp/somalier