I have a massive data table with dbSNP rs ids as rows and samples as columns in this kind of format
dbSNP Sample Sample Sample Sample Sample Sample rs10000011 CC CC CC CC TC TC rs1000002 TC TT CC TT TT TC rs10000023 TG TG TT TG TG TG rs1000003 AA AG AG AA AA AG rs10000041 TT TG TT TT TG GG rs10000046 GG GG AG GG GG GG rs10000057 AA AG GG AA AA AA rs10000073 TC TT TT TT TT TT rs10000092 TC TC CC TC TT TT
There are over a 1000 samples and >547,000 loci in this table from a HGDP dataset (ftp://ftp.cephb.fr/hgdp_supp10/), and I would like to do a massive Principle Component Analysis (with samples colored based on population).
In order to do that, I need to code my genotypes first. I was wondering, how would I do this (preferably in R, as the file is probably too big for JMP Genomics)?
Also, I have some spots lacking data, which are indicated by --- or 00. I am going to standardize those to NA using a find and replace script, but how do I code it so R will still be able to run the PCA. Thanks!
I am not sure if R can easily handle this big dataset either. I would suggest you to use PLINK (that directly evaluates the PCA), but I am afraid you will need to create extra files to describe your data. See: https://www.cog-genomics.org/plink2/input and here https://www.cog-genomics.org/plink2/strat#pca.
I can run R on UF's HPC cluster though. It will be able to handle it there.
Anyone have any suggestions? I tried stackoverflow, but they sent me back here.