I've been trying to delve into the data from whole genome sequencing, specifically by looking at the already existing data in the 1000 genome project and gnomad, and I have a lot of questions. Does gnomAD contain the 1000gp samples?
I've found many vcf including these:
(also what is the difference between the gnomAD variant and callset files and why are both so huge?)
Are the really huge vcf generated from whole genome sequencing whereas the smaller ones are from chip arrays?
I'm also looking to use this data to compare a sample to them. I'm most familiar with doing this by running a UMAP on the PCA and then clustering to see where the sample lies. I found this implementation https://github.com/diazale/umap_review/blob/master/code/umap_dev_experimentation.ipynb , but it seems to skip lines and only uses the chip array sized file I think.
I've seen that plink can run PCA on many samples. Is there a way to run plink on each of the huge chromosome callset files on gnomAD to get the PCA, then use that data to generate the UMAP clustering? I haven't been able to figure out the PCA in plink, or how to combine multiple PCA from plink.
Lastly, is there an easy way to merge callsets? It's unfeasible to redo the callset from the gnomAD data, but would it be possible to add a current vcf to the current callsets for only SNPs that are already in the existing callsets? I forsee this not losing that much data since the gnomAD callsets have plenty of SNPs that will probably match up and the discordant SNPs can be discarded. Is there a program that does this?