Forum: Feature Selection in 10X scRNA-seq data
2
gravatar for ATpoint
8 months ago by
ATpoint38k
Germany
ATpoint38k wrote:

I am looking for opinions (hands-on based experience) towards your favourit feature selection method for 10x scRNA-seq data. The motivation for this is that I recently stumbled over the GLM-PCA approach from Rafael Irizarry's lab (links see on the bottom of the post) which made me dive into the literature. As expected there are plenty of methods out there, each claiming to perform superior. Since GLM-PCA operates on raw counts it frees the uses from choosing from one of the many normalization strategies such as the ones implemented in e.g. scran, scNorm or the choices provided by Seurat, and is therefore attractive. This is admittedly not at all a precise question (therefore Forum post), and I hope to initiate some chat here about your current best practices that users inexperienced in the single-cell world (including myself) can take inspiration from.

Edit: As suggested below one might have a look at the scry package https://bioconductor.org/packages/release/bioc/vignettes/scry/inst/doc/scry.html which makes use of glmpca for feature selection and dimensionality reduction.

Edit (09/20): Just to comment how it endedn up: I did not use GLMPCA/scry package eventually as in its current state it was unusably slow on normal-sized datasets (5k cells) datasets plus regularily caused errors related to poor model fits similar to https://github.com/kstreet13/scry/issues/15. That taken together made me abandon it. The concept is definitely interesting and I hope the package reaches a stable state soon to be used productively and with a reasonable runtime.

As an alternative I ended up using the GitHub version of SCtransform::vst()

for feature selection.

GLM-PCA:

Preprint: https://www.biorxiv.org/content/10.1101/574574v1

Paper: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1861-6

Git: https://github.com/willtownes/scrna2019

CRAN: https://cran.r-project.org/web/packages/glmpca/index.html

ADD COMMENTlink modified 2 days ago • written 8 months ago by ATpoint38k
2
gravatar for igor
8 months ago by
igor11k
United States
igor11k wrote:

I saw the GLM-PCA benefits. I believe that there are at least some scenarios where it does perform better. However, does it actually uncover new biological insights? Many single-cell methods make significant improvements on some metrics and look impressive on paper, but very few would actually change the conclusions that were based on classic techniques.

Personal anecdote: I tried not normalizing the data at all and expected completely nonsensical results. However, the major populations still clearly segregated.

ADD COMMENTlink written 8 months ago by igor11k

Personal anecdote: I tried not normalizing the data at all and expected completely nonsensical results. However, the major populations still clearly segregated.

That is interesting observation indeed. Have you tried it with > n=1 to see if it is widely applicable?

ADD REPLYlink written 8 months ago by genomax89k
1

I have not experimented much with it. I've been meaning to run a more comprehensive analysis, but more pressing tasks get in the way.

ADD REPLYlink written 8 months ago by igor11k

My expectation is that you'd see fairly significant sample-to-sample effects with zero normalization, but would be interested in seeing if that's actually true.

ADD REPLYlink written 8 months ago by jared.andrews077.1k

That may be true. I normally see sample-to-sample effects regardless of normalization (without some sort of batch-correction methods like CCA/MNN/etc).

ADD REPLYlink written 8 months ago by igor11k

Think it also depends on sample. Normal PBMCs are fairly consistent between samples without batch correction through standard pipelines, assuming they're done fairly close to each other by the same person. Disease samples are a different story though.

ADD REPLYlink written 8 months ago by jared.andrews077.1k
1

Agreed. High-quality healthy samples processed the same way tend to be fairly consistent.

ADD REPLYlink modified 8 months ago • written 8 months ago by igor11k
2
gravatar for will.townes
6 months ago by
will.townes40
will.townes40 wrote:

Hi, thanks for your interest in GLM-PCA (I'm one of the authors). First of all, GLM-PCA is a dimension reduction method meant to be as similar to PCA as possible but just using a count-based likelihood (or loss function) instead of the implicit normal distribution likelihood of PCA. Since you seem to be mostly interested in feature selection (ie identifying highly informative genes), I encourage you to check out our R package scry (soon to be submitted to bioconductor) which includes feature selection based on deviance as an alternative to the more traditional "highly variable genes" approach. As you mention it operates on raw UMI counts so no need for normalization, and according to a recent comparison by an independent research group has been shown to perform well vs competing methods. The scry package also includes a null residuals transformation (similar to the sctransform method from Hafemeister et al) that can be fed directly to traditional PCA instead of normalized counts. The null residuals are basically a rough approximation to GLM-PCA that are much faster to compute. Alternatively, if you have another normalization/dimension reduction scheme in mind, you can just use the deviance feature selection to choose say the top 2,000 genes then do whatever you like with those. As a side note, we are actively working to improve the scalability and numerical stability of the GLM-PCA optimization routine, so stayed tuned for those updates in the future.

ADD COMMENTlink written 6 months ago by will.townes40

Thanks will.townes for the pointer to the scry package. Will try.

ADD REPLYlink modified 2 days ago • written 4 months ago by ATpoint38k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1148 users visited in the last hour