How to create a signature matrix based on a given single cell RNA-seq data
1
0
Entering edit mode
5 weeks ago

I am working on creating a signature matrix based on an input (a given single cell RNA-seq dataset). To clarify, my input matrix has the following format:

    AAACGGGAGATCCCGC.1 AAATGCCTCCAATGGT.1 AACCATGTCAGTGCAT.1 AACCATGTCTGTACGA.1 AACCGCGTCCGCATAA.1
LGALSL                    0                  0                  0                  0                  0
CD247                     0                  0                  0                  0                  1
XKR6                      0                  5                  0                  1                  2
KLHL23                    0                  4                  0                  0                  0
MTHFSD                    0                  0                  0                  0                  1
KHK                       0                  0                  0                  0                  0
TCERG1                    0                  2                  0                  1                  1
DNAJA3                    1                  0                  0                  1                  0
TRAPPC3L                  0                  0                  0                  0                  0
PAAF1                     0                  1                  0                  1                  1


I have 20,000 rows of genes and 30,000 columns of combination of cells from different individuals. I want to create a list of deferentially expressed genes, or formally a signature matrix. The final goal is to estimate cell type proportions.

Tools such as CIBERSORTx and DWLS do the job, and they internally create signature matrices (which include fewer genes compared to the original input). These tools become very slow when the input file is large.

Are there any other ways to create a signature matrix? In other words, for each cell type, is there any quick way of identifying a list of genes?

Here is my potential solution but need to improve it. DWLS uses a two step procedure. In the first step, it chooses genes based on a chosen fold change threshold, and then in the second step, it excludes more genes based on p-values generated by MAST. This second step is time consuming. I wonder if there was any replacement for the second step to speed it up.

1
Entering edit mode

What do you mean by signature matrix? Cell type or state-specific markers? An example would be helpful here.

0
Entering edit mode

Thanks for your comment. I updated my question.

1
Entering edit mode

Hi Ali,

If by signature matrix, you mean something like this:

Then Seurat could be a good place to start. Check these out:

https://satijalab.org/seurat/articles/install.html

https://satijalab.org/seurat/articles/get_started.html

https://satijalab.org/seurat/articles/pbmc3k_tutorial.html

If you're worried about speed check out python packages like ScanPy:

There are options (parallelization) to run both (Seurat and ScanPy) packages faster.

0
Entering edit mode

Many thanks for the info. I updated my original question with more information.

3
Entering edit mode
5 weeks ago

Depending on the information you have, there are a number of ways to do this. If you already know the cell type of each cell in question, whether that be through manual annotation or automated methods, then this is quite easy via typical marker finding (e.g. FindMarkers() from Seurat or findMarkers() from scran, etc). This will allow you to generate a list of marker genes for each cell type that you can slap into a matrix format quite easily, maybe removing those that pop up in multiple cell types depending on how stringent you want to be, the downstream analyses planned, etc.

If you don't know the cells types in your sample(s), then you will likely want to annotate them. While a bit of manual annotation using your expert biological knowledge of the samples in question is nearly unavoidable, automated methods that use reference datasets (like SingleR) to assign a cell type to each cell based on similarity to the reference dataset are usually a very useful starting point and much less subjective. Of course, this requires that you have a good reference dataset, but there are many in both celldex and the scRNAseq package that span a wide variety of tissues and cell types. Once you have your annotations in place, you can proceed with marker detection as previously described.

If your end goal is differential abundance analyses based on the cell type proportions, then this section of the OSCA book may be helpful. Sampling bias is an obvious concern, so multiple replicate samples should be a priority.

0
Entering edit mode

Many thanks. This is what I was looking for. I already have cell type annotations. I just did not get the last comment. You mention if I want to be stringent, I can remove genes appearing in multiple cell types. But I thought I should keep genes common across all (or most) cell types. Do you have any comment on this part specifically? What approach is best to narrow down the list of genes?

1
Entering edit mode

What is your goal downstream of this? If your goal is to come up with cell-type specific signatures, why would you want to retain the most prevalent markers that are found in multiple cell types?

Removing common marker genes just increases specificity of those remaining, but it depends on how granular your annotations are. If you have CD4 Th1 and CD4 Th2 cells in your dataset, the markers for those two cell types will likely be very similar to one another, as typical marker finding compares one group to all others. In such cases, removing common genes may leave you with very few markers remaining. How stringent you can or may want to be will depend on your data and what you plan to do with this signature matrix.

0
Entering edit mode

The goal is to come up with a cell-type based signature matrix, as you mentioned. I now better understand your point. Many thanks for your help.