Principal Component Analysis: How many most variable genes should I take?
2
4
Entering edit mode
4.3 years ago
tlorin ▴ 340

I want to plot my data (timeseries dataset) using PCA and am wondering how many "most variable" genes I should take. For instance, in DESeq2, the default for plotPCA is ntop=500, but why not 1,000 or simply all genes?

It's a tradeoff between computational efficiency and getting the most information out of your data. (...) In most situations I wouldn't expect much difference between a PCA plot of the top 500 genes and all of them.

However, in my case, it does change the aspect of the PCA and the relative contribution of the axes. For instance,

• for ntop=500, PC1=62% and PC2=7%
• for ntop=1,000, PC1=54% and PC2=10%
• for ntop=10,000, PC1=31% and PC2=18%

(I'm sorry, I cannot upload the actual graphs).

Which one should I "trust" more? Should I take all genes or rather a subset of the most variants, and if yes, how many? Many thanks!

RNA-Seq DESeq2 • 7.6k views
0
Entering edit mode

If you sum PC1 and PC2, ntop =500 obtain 69, the other are below. I would take 500, because it explains globally better your results

6
Entering edit mode
4.3 years ago

There is no standard cut-off. Be aware also that the plotPCA() function of DESeq2 already automatically removes low variance genes whenever you use it.

In addition, your question depends on what your end goal is, and also on what your PC1 is actually segregating - is it disease versus control? If you are trying to model some condition that is being segregated by PC1, why not take the top 500 and then 'throw' them into a lasso-penalized regression to see which ones actually best predict the condition of interest (that's being segregated by PC1).

If this is a discovery ('training') dataset and you are just aiming to filter out genes based on variance for PCA and not even use them in differential expression analysis, then you can possibly justify that; however, you will be criticised when you try to publish (but maybe not rejected depending on who reviews the work). It is more common to filter out genes of low variance when you are actually validating a panel derived from a discovery set. For example, if one had a panel of 200 genes from the discovery phase and was validating / testing these using NanoString or Fluidigm in the same or an independent cohort, then you could remove 50% of the dataset if its variance was low. In these validation studies, however, we invariably do not end up using programs like DESeq2, EdgeR, etc., and instead run other independent univariate or multivariate regression models.

There is no right or wrong answer here, though.

0
Entering edit mode

@Kevin Blighe Thanks for your answer! My data is a timeseries dataset with 21 sample points, so not disease vs control (I don't know if that would change your point?). I added this precision in the post above.

1
Entering edit mode

Hey, thanks for elaborating - it does and does not change my perspective. I think that it still depends on what is being segregated by your PC1: if it is clearly segregating one time-point from another, then you can justify taking a look at the PC1 loadings and removing any genes that are not responsible for the variation (then, the remainder will be the genes responsible for differences between the time-points).

What if you had the following scenario, though: you look at eigenvector 10 (PC10) and it is responsible for just 5% variation in your ~20,000 gene dataset but segregates extremely well one timepoint from another. If you initially throw away information from your data, and in the context where a PC1 and 2 contribute to a reasonably high level of variation of some other form, some of the genes responsible for variation at this low level may additionally be thrown away.

We do want to throw away useless junk, such as genes with very low counts or a lot of missing data, but we do that through other means.

For your main statistical analyses, I think that you should just leave everything in (at least initially) and then, afterwards, make a decision about removing genes if you feel that it is necessary. This will also depend on which statistical test you are aiming to use.

These days, I don't believe that the "computational efficiency" to which they refere in your quoted passage is much of an issue for this type of work.

0
Entering edit mode

I think that it still depends on what is being segregated by your PC1: if it is clearly segregating one time-point from another

This is indeed the case: with ntop=500 PC1 segregates the time points quite well. If I increase ntop, this "time" signal is no longer visible on the sole PC1 and it starts to get "diluted" on other PCs.

1
Entering edit mode

Interesting! Based on that observation, I think that you have every justification to proceed along that line. Like I implied, there's no right or wrong answer here.

1
Entering edit mode
4.3 years ago
MAPK ★ 1.9k

Try snprelate r package:http://bioconductor.org/packages/release/bioc/html/SNPRelate.html You can select top genes and top samples contributing to most variance in a pca.