Question: Principal Component Analysis: How many most variable genes should I take?
gravatar for tlorin
8 months ago by
tlorin240 wrote:

I want to plot my data (timeseries dataset) using PCA and am wondering how many "most variable" genes I should take. For instance, in DESeq2, the default for plotPCA is ntop=500, but why not 1,000 or simply all genes?

I saw on this page that

It's a tradeoff between computational efficiency and getting the most information out of your data. (...) In most situations I wouldn't expect much difference between a PCA plot of the top 500 genes and all of them.

However, in my case, it does change the aspect of the PCA and the relative contribution of the axes. For instance,

  • for ntop=500, PC1=62% and PC2=7%
  • for ntop=1,000, PC1=54% and PC2=10%
  • for ntop=10,000, PC1=31% and PC2=18%

(I'm sorry, I cannot upload the actual graphs).

Which one should I "trust" more? Should I take all genes or rather a subset of the most variants, and if yes, how many? Many thanks!

rna-seq deseq2 • 811 views
ADD COMMENTlink modified 8 months ago by MAPK1.2k • written 8 months ago by tlorin240

If you sum PC1 and PC2, ntop =500 obtain 69, the other are below. I would take 500, because it explains globally better your results

ADD REPLYlink written 8 months ago by haro220
gravatar for Kevin Blighe
8 months ago by
Kevin Blighe21k
University College London Cancer Institute
Kevin Blighe21k wrote:

There is no standard cut-off. Be aware also that the plotPCA() function of DESeq2 already automatically removes low variance genes whenever you use it (this is why I don't use it and instead have my own PCA code).

In addition, your question depends on what your end goal is, and also on what your PC1 is actually segregating - is it disease versus control? If you are trying to model some condition that is being segregated by PC1, why not take the top 500 and then throw them into a lasso-penalized regression to see which ones actually best predict the condition of interest (that's being segregated by PC1).

If this is a discovery/training dataset and you are just aiming to filter out genes and not even use them in differential expression analysis, then you can possibly justify that but you would be criticised when you try to publish (but maybe not rejected). It is more common to filter out genes of low variance when you are actually validating a derived panel from a discovery set. For example, if one had a panel of 200 genes from a discovery test and was testing these on NanoString or Fluidigm, then you could remove 50% of the dataset if its variance was low.

There is no right or wrong answer here.

ADD COMMENTlink written 8 months ago by Kevin Blighe21k

@Kevin Blighe Thanks for your answer! My data is a timeseries dataset with 21 sample points, so not disease vs control (I don't know if that would change your point?). I added this precision in the post above.

ADD REPLYlink written 8 months ago by tlorin240

Hey, thanks for elaborating - it does and does not change my perspective. I think that it still depends on what is being segregated by your PC1: if it is clearly segregating one time-point from another, then you can justify taking a look at the PC1 eigenvalues and removing any genes that are not responsible for the variation (then, the remainder will be the genes responsible for differences between the time-points).

What if you had the following scenario, though: you look at eigenvector 10 (PC10) and it is responsible for just 5% variation in your ~20,000 gene dataset but segregates extremely well one timepoint from another. If you initially throw away information from your data, and in the context where a PC1 and 2 contribute to a reasonably high level of variation of some other form, some of the genes responsible for variation at this low level may additionally be thrown away.

We do want to throw away useless junk, such as genes with very low counts or a lot of missing data, but we do that through other means.

For your main statistical analyses, I think that you should just leave everything in (at least initially) and then, afterwards, make a decision about removing genes if you feel that it is necessary. This will also depend on which statistical test you are aiming to use.

These days, I don't believe that the "computational efficiency" to which they referein your quoted passage is much of an issue for this type of work.

ADD REPLYlink modified 8 months ago • written 8 months ago by Kevin Blighe21k

I think that it still depends on what is being segregated by your PC1: if it is clearly segregating one time-point from another

This is indeed the case: with ntop=500 PC1 segregates the time points quite well. If I increase ntop, this "time" signal is no longer visible on the sole PC1 and it starts to get "diluted" on other PCs.

ADD REPLYlink written 8 months ago by tlorin240

Interesting! Based on that observation, I think that you have every justification to proceed along that line. Like I implied, there's no right or wrong answer here.

ADD REPLYlink written 8 months ago by Kevin Blighe21k
gravatar for MAPK
8 months ago by
United States
MAPK1.2k wrote:

Try snprelate r package: You can select top genes and top samples contributing to most variance in a pca.

ADD COMMENTlink written 8 months ago by MAPK1.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 859 users visited in the last hour