Question

How to select nFeature parameter of SelectIntegrationFeatures function in Seurat

2

Entering edit mode

2.2 years ago

rohitsatyam102 ▴ 850

Hi everyone. I am dealing with a parasite that contains ~5000 genes. The scRNAseq data is from a stage of its life cycle where few genes are expressed (the median genes expressed are ~100 in our data).

In that case, does it make sense to use nfeature default value 2000 while running SelectIntegrationFeatures.
Also, the UMAP looks as given below (we have data from two-time points with 1 hr difference between T1 and T2). So should I perform integration or not?
Finally since there is no profound batch effect should I still use SCTransform or follow the pbmc3K workflow?

I am not getting many DE genes either way (with or without integration) and DE genes number fairly overlap.

enter image description here

scrnaseq Seurat • 2.5k views

ADD COMMENT • link 2.1 years ago by rohitsatyam102 ▴ 850

0

Entering edit mode

2.2 years ago

alan.ocallaghan • 0

I can't really advise you on any of the Seurat-specific settings (ie whether to use sctransform or follow the workflow). However I'd say for the nfeatures question, there's 2 reasons to exclude genes. The first is computational: if you have 20k genes and 20k cells, obviously it's a computational nightmare, which is why Seurat workflows take just 2k genes or whatever. The second is the same reason people often use PCA for clustering: for denoising. By removing very lowly-expressed genes, you're removing features where the signal to noise ratio is near 1:1, and it's hard to tell if there's any value added by these features.

I'd be inclined in situations like this to plot the distribution of mean expression across all genes, and to identify a lower bound of this distribution, below which you feel there's not any useful information being added. This will probably result in a dataset with relatively few genes, so I'd check the scTransform paper for how many genes they recommend to stably estimate their regression coefficients. I'd also recommend trying the Lause method for analytic Pearson residuals, as it won't have the same issue.

Of course you could just subset to all genes with at least one non-zero count across all cells. However this often leads to situations where you have a lot of features that add basically no useful informaiton. Furthermore if you do that you have to be VERY careful to not scale the data to have equal variance - if you do that, the genes that represent the main sources of variation in the data will be given equal weight as genes with one count in one cell and zero across all others.

One general criticism I have of the Seurat workflows is they really don't seem to discard genes very often. I've yet to see an scRNAseq dataset where keeping the full ~20k genes for human samples proves useful, when often the vast majority of them have no signal in them.

ADD COMMENT • link 2.2 years ago by alan.ocallaghan • 0

0

Entering edit mode

This will probably result in a dataset with relatively few genes, so I'd check the scTransform paper for how many genes they recommend to stably estimate their regression coefficients.

I went on to read scTransform paper but I could not find recommendations on how many genes are appropriate. Do you remember any from the paper?

ADD REPLY • link 2.1 years ago by rohitsatyam102 ▴ 850

0

Entering edit mode

I tried producing figures present in Sctransform paper as you suggested but couldn't conclude anything. enter image description here

ADD REPLY • link 2.1 years ago by rohitsatyam102 ▴ 850

score 2 · Accepted Answer · 2022-03-07

2

Entering edit mode

2.1 years ago

jared.andrews07 ★ 16k

I'd probably just try it with a few different numbers and see how it does (e.g. 2000, 1000, 500, 250, 100, etc). Hard to answer your second question without just slapping both timepoints in the same plot. I'd probably still try integration.

ADD COMMENT • link 2.1 years ago by jared.andrews07 ★ 16k