Question: Need good way to run a Principle Component Analysis on SNP data
0
gravatar for devenvyas
4.4 years ago by
devenvyas590
Stony Brook
devenvyas590 wrote:

I am bit of a totally lost situation, I get one thing right, but then two things go wrong (sorry if some of these comments are repeats of stuff from previous threads). I need some help/advice on how to proceed.

Just for context, this dataset is meant for two things (for now), which are to 1) see how these samples related to various global populations using analyses such as PCA and STRUCTURE (or similar) and 2) estimate archaic introgression (i.e., Neanderthals and Denisovans)

I have 92 samples that were genotyped on the Affy Human Origins array from about 570,000 SNPs. I also have comparative data from 934 HGDP samples that I have downloaded, but there is just so much of it that I need to cull it down (ftp://ftp.cephb.fr/hgdp_supp10/Harvard_HGDP-CEPH/all_snp.map.gz ftp://ftp.cephb.fr/hgdp_supp10/Harvard_HGDP-CEPH/all_snp.ped.gz).

For the moment being, I've been trying to just PCA my own samples (n=92), but this isn't working either

I coded my genotypes to 0/1/2s using JMP Genomics and then did this using R on UF's computer cluster

read.table('92simpleb_recgeno.txt', sep='\t', header=TRUE, row.names=1)->table
pcol=c(rep("green",3),rep("blue",89))
help(prcomp)
table[ table == "." ] = NA
t(table) -> trans
pca<-prcomp(~ ., data=trans, na.action = na.omit)
save(pca, file="1.RData")

and I got this back

Error: cannot allocate vector of size 1284.9 Gb
Execution halted

Basically, I am lost (and a bit frustrated), and I need some suggestions on how to PCA this much data without metaphorically blowing something up. Currently, the samples are formatted with columns as samples and rows as loci, but I can re-export the data from Affy Genotyping Console in a different format. Any suggestions on what I should do.

snp pca • 2.3k views
ADD COMMENTlink modified 4.3 years ago by jamesxli200710 • written 4.4 years ago by devenvyas590

R cannot handle a matrix of that size. If you are working on Snp data, maybe you can try EIGENSTRAT under EIGENSOFT 

ADD REPLYlink written 4.4 years ago by Sam2.5k

You could also use PLINK: https://www.cog-genomics.org/plink2/strat#pcaPLINK's pca only uses a subset of samples making the computation feasible and (somehow) fast.

ADD REPLYlink written 4.3 years ago by alesssia530

I am about to try an MDS on Plink 1.07. Do you happen to know if the application support multiple threads/processors?

ADD REPLYlink written 4.3 years ago by devenvyas590

Not sure whether PLINK 1.07 supports mutithreading, but I would say no. Why don't you use PLINK 1.9? It supports threads and it is more optimised (in general). 

For PLINK 1.9: https://www.cog-genomics.org/plink2/, for threads: https://www.cog-genomics.org/plink2/other#threads

ADD REPLYlink written 4.3 years ago by alesssia530

Currently, UF's cluster does not have Plink 1.9 installed (just 1.07). I am going to ask them to add it, but that will take a day or two (thus I am using 1.07 for now).

ADD REPLYlink written 4.3 years ago by devenvyas590

I think I am going to take alesssia's advise and finally give Plink a try for now as my gigantic comparative data is already in Plink format, but I had a question about EIGENSTRAT. Eventually, I am possibly going to need to use the EIGENSTRAT file format for Admixtools. Does the package have a method of converting Plink files to EIGENSTRAT?

ADD REPLYlink written 4.3 years ago by devenvyas590

See here: Conversion tools for GWAS

ADD REPLYlink written 4.3 years ago by alesssia530
1
gravatar for jamesxli2007
4.3 years ago by
jamesxli200710
Canada
jamesxli200710 wrote:

It looks like your R program is trying to build a 570K x 570K covariance matrix, which is huge. A better way to do PCA for such wide matrix is to pass the transposed the 570000x92 matrix to the PCA algorithm, which will calculate the eigenvectors based on the 92x92 covariance matrix. You can then multiply the resulting 92x92 eigenvector matrix with your original 570000x92 matrix to get eigenvectors for your original data table.

ADD COMMENTlink written 4.3 years ago by jamesxli200710

Well spotted! If you want to see if the 92 samples tend to group in clusters you can do an xy-plot of the first two principal components obtained from the matrix t(570000x92), which is not too computationally expensive right? Then you say "You can then multiply the resulting 92x92 eigenvector matrix with your original 570000x92 matrix to get eigenvectors for your original data table." What would be the interpretation or use of this new matrix? Thanks!

ADD REPLYlink written 4.3 years ago by dariober10k

It was the size of the dataframe the PCA was trying to make that was the problem, not the CPU expense. Anyways, I started doing Plink MDS plots and I think they are working.
 

ADD REPLYlink written 4.3 years ago by devenvyas590
0
gravatar for dariober
4.4 years ago by
dariober10k
WCIP | Glasgow | UK
dariober10k wrote:

Is it really necessary to use all the 570,000 SNPs at once? If you just want to see how your 92 arrays group on a PCA plot you could randomly sample a bunch of maybe 50,000 SNPs and do PCA on that. Repeat with different random samples to see how consistent they are.

ADD COMMENTlink written 4.4 years ago by dariober10k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1721 users visited in the last hour