Question: Is feature selection necessary before dimensional reduction in single-cell analysis?
1
gravatar for Phoe
14 months ago by
Phoe20
Phoe20 wrote:

Hi all,

What do you think about the feature selection part in single-cell RNA analysis? I am familiar with Seurat and 10X platform (Cell Ranger ). I've noticed that the default setting of Seurat is using 2000 HVGs (Highly Variable Genes) for dimensional reduction (PCA/tSNE/UMAP) but the default setting of Cell Ranger is using the whole features(genes) instead.

In the aspect of computing efficiency, feature selection could reduce dimensions and thus speed up the calculations., but which also has a prior assumption such that those differences are due to biological difference between the cells rather than technical noise. (see 6.3.1.1 Highly Variable Genes) Personally, I do realize the advantages of this feature selection approach, which is also a basic concept of machine learning.

However, there are several questions that haunted me for a while:

(1) How many features do we choose? How do you guys test this?

(2) What if the clustering results based on the whole features and 2000 HVGs are very different? I've once seen data with two cell samples (WT/MUT) using 2000 HVGs, which shows no "batch effect" (cells were mixed with no correlation between samples); on the other hand, using all genes, which shows strongly "batch effect" (cells were clustered by WT/MUT obviously). This could definitely affect the decision of whether to do the batch correction (Seurat integration, MNN...etc.) or not.

Any thoughts, opinions, suggestions would be totally appreciated.

Thank you!

ADD COMMENTlink modified 14 months ago • written 14 months ago by Phoe20
1

In my opinion: 1) feature selection is not necessary; 2) you should always validate your clustering with known markers to check whether this clustering is reasonable, and adjust your parameters accordingly; 3) pca is just linear transformation. It won't give you accurate results for complicated dataset.

ADD REPLYlink written 14 months ago by shoujun.gu310

Hi, thanks for the suggestions!

ADD REPLYlink written 14 months ago by Phoe20
1
gravatar for Mensur Dlakic
14 months ago by
Mensur Dlakic7.1k
USA
Mensur Dlakic7.1k wrote:

A simple response is NO.

However, depending on the number of starting features and the exact method used for dimensionality reduction, it may be helpful to select features beforehand. Some of the arguments I will present here briefly have been made already in this post.

Dimensionality reduction methods differ both in how they handle number of features and in linear/non-linear nature of new features. For example, PCA is reproducible (deterministic) and very fast, even for large number of features. Once you train a PCA model, it can be applied to new data that are in the same format as the original dataset. t-SNE is non-reproducible (at least not perfectly reproducible) and relatively slow with large number of features (anything over 30-50) or sample (>50000), but it produces more informative embedding and more intuitive clusters when there are non-linear relationships between features. UMAP scales well to large numbers of features and samples, and can capture non-linear relationships between features.

A more elaborate answer to your original question: 1) PCA is fast and does not require feature selection, but it will not produce informative plots in some complex cases; 2) as a pre-processing step for t-SNE, it is advisable to do PCA on data and specify 30-50 principal components as outputs if the number of original features is >>50; t-SNE will still be relatively slow and it is non-parametric (models can't be saved and applied to new datasets); 3) UMAP is slower than PCA but much faster than t-SNE, and it works well for large datasets; its models also can be saved. Even more to the point: PCA is fastest but does not always give clear cluster structure; t-SNE is slowest but often gives visually most pleasing result; UMAP is somewhere between the two both in terms of speed and visualization.

ADD COMMENTlink written 14 months ago by Mensur Dlakic7.1k

Thank you. I think 2) is worth noticing, but as we are using single-cell data, the feature (here refers to genes) could >> 50 frequently right? Also, in your experience, if data points (cells) using same PCs were overlapped within different clusters (e.g. cluster 1 and cluster 2 were mixed together) of t-SNE plot, but they were separated explicitly in the UMAP, how would you explain this? What else will you check? Does it mean the t-SNE plot couldn't explain much of this complicated data?

ADD REPLYlink written 14 months ago by Phoe20
1
gravatar for igor
14 months ago by
igor11k
United States
igor11k wrote:

I've noticed that the default setting of Seurat is using 2000 HVGs (Highly Variable Genes) for dimensional reduction (PCA/tSNE/UMAP) but the default setting of Cell Ranger is using the whole features(genes) instead.

Since you mentioned Seurat specifically, according to the Seurat developers (GitHub):

we typically do not notice large differences in the analysis depending on the exact number of genes selected- ranging from 2k genes to even the full transcriptome

ADD COMMENTlink written 14 months ago by igor11k

Thanks! I've also seen this, despite the user was asking about the integration method of Seurat.

Indeed, the number of features used does affect the output of all single-cell analyses (including clustering, integration, pseudotime, etc.). Unfortunately we can't advise on the exact value to choose, but agree that the sensitivity of some analyses to this parameter can be frustrating. Our best suggestion is to use the SCTransform workflow, which weights genes in downstream analysis based on their biological variation. As a result, adding more genes into the analysis makes less of a difference, because they have lower weights. As a result, we find that the results exhibit less sensitivity based on the number of features included.

ADD REPLYlink written 14 months ago by Phoe20
1

In that same issue, although the results are somewhat different with the different number of genes, it's not clear which version is better. It's possible that both representations are equally inaccurate.

ADD REPLYlink modified 14 months ago • written 14 months ago by igor11k

Found another description similar to what Igor provided.

Generally, we find that 2-3K genes tend to work well for most datasets that we analyze (and that's what we use in all vignettes).

https://github.com/satijalab/seurat/issues/1989

ADD REPLYlink modified 14 months ago • written 14 months ago by Phoe20
0
gravatar for Phoe
14 months ago by
Phoe20
Phoe20 wrote:

Hi all, I found this review very informative, which mentioned some critical points in unsupervised clustering.

ADD COMMENTlink modified 14 months ago • written 14 months ago by Phoe20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1413 users visited in the last hour