Need good way to run a Principle Component Analysis on SNP data
2
0
Entering edit mode
9.3 years ago
devenvyas ▴ 760

I am bit of a totally lost situation, I get one thing right, but then two things go wrong (sorry if some of these comments are repeats of stuff from previous threads). I need some help/advice on how to proceed.

Just for context, this dataset is meant for two things (for now), which are to 1) see how these samples related to various global populations using analyses such as PCA and STRUCTURE (or similar) and 2) estimate archaic introgression (i.e., Neanderthals and Denisovans)

I have 92 samples that were genotyped on the Affy Human Origins array from about 570,000 SNPs. I also have comparative data from 934 HGDP samples that I have downloaded, but there is just so much of it that I need to cull it down (ftp://ftp.cephb.fr/hgdp_supp10/Harvard_HGDP-CEPH/all_snp.map.gz ftp://ftp.cephb.fr/hgdp_supp10/Harvard_HGDP-CEPH/all_snp.ped.gz).

For the moment being, I've been trying to just PCA my own samples (n=92), but this isn't working either

I coded my genotypes to 0/1/2s using JMP Genomics and then did this using R on UF's computer cluster

read.table('92simpleb_recgeno.txt', sep='\t', header=TRUE, row.names=1)->table
pcol=c(rep("green",3),rep("blue",89))
help(prcomp)
table[ table == "." ] = NA
t(table) -> trans
pca<-prcomp(~ ., data=trans, na.action = na.omit)
save(pca, file="1.RData")

and I got this back

Error: cannot allocate vector of size 1284.9 Gb
Execution halted

Basically, I am lost (and a bit frustrated), and I need some suggestions on how to PCA this much data without metaphorically blowing something up. Currently, the samples are formatted with columns as samples and rows as loci, but I can re-export the data from Affy Genotyping Console in a different format. Any suggestions on what I should do.

pca SNP • 4.2k views
ADD COMMENT
0
Entering edit mode

R cannot handle a matrix of that size. If you are working on Snp data, maybe you can try EIGENSTRAT under EIGENSOFT.

ADD REPLY
0
Entering edit mode

You could also use PLINK: https://www.cog-genomics.org/plink2/strat#pca. PLINK's pca only uses a subset of samples making the computation feasible and (somehow) fast.

ADD REPLY
0
Entering edit mode

I am about to try an MDS on Plink 1.07. Do you happen to know if the application support multiple threads/processors?

ADD REPLY
0
Entering edit mode

Not sure whether PLINK 1.07 supports mutithreading, but I would say no. Why don't you use PLINK 1.9? It supports threads and it is more optimised (in general).

For PLINK 1.9: https://www.cog-genomics.org/plink2/, for threads: https://www.cog-genomics.org/plink2/other#threads

ADD REPLY
0
Entering edit mode

Currently, UF's cluster does not have Plink 1.9 installed (just 1.07). I am going to ask them to add it, but that will take a day or two (thus I am using 1.07 for now).

ADD REPLY
0
Entering edit mode

I think I am going to take alesssia's advise and finally give Plink a try for now as my gigantic comparative data is already in Plink format, but I had a question about EIGENSTRAT. Eventually, I am possibly going to need to use the EIGENSTRAT file format for Admixtools. Does the package have a method of converting Plink files to EIGENSTRAT?

ADD REPLY
0
Entering edit mode
ADD REPLY
1
Entering edit mode
9.2 years ago
jamesxli2007 ▴ 40

It looks like your R program is trying to build a 570K x 570K covariance matrix, which is huge. A better way to do PCA for such wide matrix is to pass the transposed the 570000x92 matrix to the PCA algorithm, which will calculate the eigenvectors based on the 92x92 covariance matrix. You can then multiply the resulting 92x92 eigenvector matrix with your original 570000x92 matrix to get eigenvectors for your original data table.

ADD COMMENT
0
Entering edit mode

Well spotted! If you want to see if the 92 samples tend to group in clusters you can do an xy-plot of the first two principal components obtained from the matrix t(570000x92), which is not too computationally expensive right? Then you say "You can then multiply the resulting 92x92 eigenvector matrix with your original 570000x92 matrix to get eigenvectors for your original data table." What would be the interpretation or use of this new matrix? Thanks!

ADD REPLY
0
Entering edit mode

It was the size of the dataframe the PCA was trying to make that was the problem, not the CPU expense. Anyways, I started doing Plink MDS plots and I think they are working.

ADD REPLY
0
Entering edit mode
9.3 years ago

Is it really necessary to use all the 570,000 SNPs at once? If you just want to see how your 92 arrays group on a PCA plot you could randomly sample a bunch of maybe 50,000 SNPs and do PCA on that. Repeat with different random samples to see how consistent they are.

ADD COMMENT

Login before adding your answer.

Traffic: 1349 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6