Question: Principal Component Analysis: How many most variable genes should I take?
gravatar for tlorin
2.7 years ago by
tlorin270 wrote:

I want to plot my data (timeseries dataset) using PCA and am wondering how many "most variable" genes I should take. For instance, in DESeq2, the default for plotPCA is ntop=500, but why not 1,000 or simply all genes?

I saw on this page that

It's a tradeoff between computational efficiency and getting the most information out of your data. (...) In most situations I wouldn't expect much difference between a PCA plot of the top 500 genes and all of them.

However, in my case, it does change the aspect of the PCA and the relative contribution of the axes. For instance,

  • for ntop=500, PC1=62% and PC2=7%
  • for ntop=1,000, PC1=54% and PC2=10%
  • for ntop=10,000, PC1=31% and PC2=18%

(I'm sorry, I cannot upload the actual graphs).

Which one should I "trust" more? Should I take all genes or rather a subset of the most variants, and if yes, how many? Many thanks!

rna-seq deseq2 • 3.2k views
ADD COMMENTlink modified 2.7 years ago by MAPK1.5k • written 2.7 years ago by tlorin270

If you sum PC1 and PC2, ntop =500 obtain 69, the other are below. I would take 500, because it explains globally better your results

ADD REPLYlink written 2.7 years ago by vmicrobio250
gravatar for Kevin Blighe
2.7 years ago by
Kevin Blighe61k
University College London
Kevin Blighe61k wrote:

There is no standard cut-off. Be aware also that the plotPCA() function of DESeq2 already automatically removes low variance genes whenever you use it (this is why I don't use it and instead have my own PCA code).

In addition, your question depends on what your end goal is, and also on what your PC1 is actually segregating - is it disease versus control? If you are trying to model some condition that is being segregated by PC1, why not take the top 500 and then 'throw' them into a lasso-penalized regression to see which ones actually best predict the condition of interest (that's being segregated by PC1).

If this is a discovery ('training') dataset and you are just aiming to filter out genes based on variance or PCA and not even use them in differential expression analysis, then you can possibly justify that; however, you will be criticised when you try to publish (but maybe not rejected). It is more common to filter out genes of low variance when you are actually validating a panel derived from a discovery set. For example, if one had a panel of 200 genes from the discovery phase and was validating / testing these using NanoString or Fluidigm in the same or an independent cohort, then you could remove 50% of the dataset if its variance was low.

There is no right or wrong answer here, though.

ADD COMMENTlink modified 17 months ago • written 2.7 years ago by Kevin Blighe61k

@Kevin Blighe Thanks for your answer! My data is a timeseries dataset with 21 sample points, so not disease vs control (I don't know if that would change your point?). I added this precision in the post above.

ADD REPLYlink written 2.7 years ago by tlorin270

Hey, thanks for elaborating - it does and does not change my perspective. I think that it still depends on what is being segregated by your PC1: if it is clearly segregating one time-point from another, then you can justify taking a look at the PC1 loadings and removing any genes that are not responsible for the variation (then, the remainder will be the genes responsible for differences between the time-points).

What if you had the following scenario, though: you look at eigenvector 10 (PC10) and it is responsible for just 5% variation in your ~20,000 gene dataset but segregates extremely well one timepoint from another. If you initially throw away information from your data, and in the context where a PC1 and 2 contribute to a reasonably high level of variation of some other form, some of the genes responsible for variation at this low level may additionally be thrown away.

We do want to throw away useless junk, such as genes with very low counts or a lot of missing data, but we do that through other means.

For your main statistical analyses, I think that you should just leave everything in (at least initially) and then, afterwards, make a decision about removing genes if you feel that it is necessary. This will also depend on which statistical test you are aiming to use.

These days, I don't believe that the "computational efficiency" to which they refere in your quoted passage is much of an issue for this type of work.

ADD REPLYlink modified 19 months ago • written 2.7 years ago by Kevin Blighe61k

I think that it still depends on what is being segregated by your PC1: if it is clearly segregating one time-point from another

This is indeed the case: with ntop=500 PC1 segregates the time points quite well. If I increase ntop, this "time" signal is no longer visible on the sole PC1 and it starts to get "diluted" on other PCs.

ADD REPLYlink written 2.7 years ago by tlorin270

Interesting! Based on that observation, I think that you have every justification to proceed along that line. Like I implied, there's no right or wrong answer here.

ADD REPLYlink written 2.7 years ago by Kevin Blighe61k
gravatar for MAPK
2.7 years ago by
MAPK1.5k wrote:

Try snprelate r package: You can select top genes and top samples contributing to most variance in a pca.

ADD COMMENTlink written 2.7 years ago by MAPK1.5k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1884 users visited in the last hour