Question

PCA for BIG DATA

0

Entering edit mode

23 months ago

Giulia.cosenza ▴ 100

Hi, I was wondering if there is an R package for PCA for big data. I'm working with a data frame with more than 80000 variables.

Thank you!

R PCA • 1.7k views

ADD COMMENT • link 23 months ago by Giulia.cosenza ▴ 100

0

Entering edit mode

What is the question you try to address with PCA and why is having so many variables a problem?

ADD REPLY • link 23 months ago by Jean-Karim Heriche 27k

0

Entering edit mode

I personnally use FactoMineR but whatever the number of variables you have, prcomp() should work. The more you have variables the more it will take time but I do not see any other inconvenience

ADD REPLY • link 23 months ago by Basti ★ 2.0k

0

Entering edit mode

I attach the model I use to perform the PCA and the error I get.

enter image description here

ADD REPLY • link 23 months ago by Giulia.cosenza ▴ 100

0

Entering edit mode

Should it be installed outside R ?

ADD REPLY • link 23 months ago by Giulia.cosenza ▴ 100

score 2 · Answer 1 · 2022-05-19

2

Entering edit mode

23 months ago

4galaxy77 2.8k

I typically use irlba::prcomp_irlba for truncated principle components of large matrices.

https://www.rdocumentation.org/packages/irlba/versions/2.3.5/topics/prcomp_irlba

ADD COMMENT • link 23 months ago by 4galaxy77 2.8k

score 2 · Answer 2 · 2022-05-19

As far as I know, none of the PCA implementations care about the number of variables. It will take longer and require more memory to calculate with 80000 than with 80 variables, but PCA is still one of the fastest dimensionality reduction techniques. It sounds like you are having a memory problem.

I just created a random dataset with 10000 points and 80000 features. That took about 25 minutes. Calculating first 50 PCs took altogether 48 minutes, of which most of the time (44 minutes) was spent on data loading and normalization.

So it can definitely be done assuming a computer with reasonable memory (I'd say 32-64 Gb depending on the number of data points). I know you didn't ask for python implementation, but just in case if R packages don't pan out:

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

Finally, as already suggested you may want to consider a truncated SVD (tSVD) to reduce the dataset before plugging it into PCA, although some PCA implementations already use tSVD (not the fastest approach). It is very likely that a majority of your features are not informative, and tSVD will make it more manageable for PCA and potentially other downstream applications.

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html

score 1 · Answer 3 · 2022-05-19

My PCAtools package is fine for 'big data', thanks to implementations by Aaron Lun. In it, PCA is actually performed via BiocSingular::runPCA(), which means, therefore, that it is also compute-parallelised enabled (enabled for compute parallelisation).

https://bioconductor.org/packages/release/bioc/html/PCAtools.html

Kevin