I want to plot my data (timeseries dataset) using PCA and am wondering how many "most variable" genes I should take. For instance, in DESeq2, the default for
ntop=500, but why not 1,000 or simply all genes?
I saw on this page that
It's a tradeoff between computational efficiency and getting the most information out of your data. (...) In most situations I wouldn't expect much difference between a PCA plot of the top 500 genes and all of them.
However, in my case, it does change the aspect of the PCA and the relative contribution of the axes. For instance,
ntop=500, PC1=62% and PC2=7%
ntop=1,000, PC1=54% and PC2=10%
ntop=10,000, PC1=31% and PC2=18%
(I'm sorry, I cannot upload the actual graphs).
Which one should I "trust" more? Should I take all genes or rather a subset of the most variants, and if yes, how many? Many thanks!