Find clusters using marker genes
Entering edit mode
3 months ago
rmf ★ 1.4k

The normal workflow is to cluster cells and then find marker genes for those clusters. Is it possible to do the reverse? I have marker genes. I want to find cell clusters using those genes. I have marker lists like this:

cell marker
A    geneA
A    geneH
A    geneK
B    geneB
B    geneT
C    geneD
C    geneS
C    genew
seurat single-cell cell-typing sc-rnaseq • 339 views
Entering edit mode
3 months ago
rmf ★ 1.4k

So, it turns out, this questions was asked previously. And the answer there took me in a completely wrong direction about cell deconvolution. Turns out what I needed was "automatic marker assisted cell type identification".

It found it hard to find relevant tools because most automated cell type identification tools (1, 2) are fully automatic and estimates cell types using a reference dataset rather than marker genes. One of the papers has this table where you can see very few tools use prior knowledge.

One tool that I found was scSorter. But this was buggy, took forever to run (1-2 hours for 600 genes in total and 50 markers) and the results were not that great. SCINA worked and it was fast (Few minutes for 600 genes in total and 50 markers). Other options could be MACA and Garnett in Monocle which I haven't tried yet.

Entering edit mode
3 months ago
theHumanBorch ▴ 150

Hey rmf,

Most of the single-cell workflows have a feature determining step. For example, in Seurat, there is FindVariableFeatures() (below).

pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))

You could theoretically input your marker genes into the PCA calculation and downstream analysis.

pbmc <- RunPCA(pbmc, features = gene.list$marker)

I have tested this a little with subclustering specific cell lineages without much success. It converts the pipeline from semi to total supervised and I think it would require a lot of prior knowledge in order to produce coherent and meaningful clusters. It also should be noted that this is not necessarily using the marker genes to find clusters but calculating the eigenvectors from the PCA, which are then used for clustering and UMAP/TSNE generation.

Entering edit mode

I see what you mean, but I am thinking of separating DR and clustering as two independent steps. So, Variable genes should still drive the DR/UMAP and only clustering should be based on marker genes. Perhaps, the NN graph used for clustering should be built using marker genes. But, either way, I still need to define K groups. And I want to avoid that. The groups must be identified based on the marker genes that I provide.


Login before adding your answer.

Traffic: 2279 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6