I am working on creating a signature matrix based on an input (a given single cell RNA-seq dataset). To clarify, my input matrix has the following format:
AAACGGGAGATCCCGC.1 AAATGCCTCCAATGGT.1 AACCATGTCAGTGCAT.1 AACCATGTCTGTACGA.1 AACCGCGTCCGCATAA.1 LGALSL 0 0 0 0 0 CD247 0 0 0 0 1 XKR6 0 5 0 1 2 KLHL23 0 4 0 0 0 MTHFSD 0 0 0 0 1 KHK 0 0 0 0 0 TCERG1 0 2 0 1 1 DNAJA3 1 0 0 1 0 TRAPPC3L 0 0 0 0 0 PAAF1 0 1 0 1 1
I have 20,000 rows of genes and 30,000 columns of combination of cells from different individuals. I want to create a list of deferentially expressed genes, or formally a signature matrix. The final goal is to estimate cell type proportions.
Tools such as CIBERSORTx and DWLS do the job, and they internally create signature matrices (which include fewer genes compared to the original input). These tools become very slow when the input file is large.
Are there any other ways to create a signature matrix? In other words, for each cell type, is there any quick way of identifying a list of genes?
Here is my potential solution but need to improve it. DWLS uses a two step procedure. In the first step, it chooses genes based on a chosen fold change threshold, and then in the second step, it excludes more genes based on p-values generated by MAST. This second step is time consuming. I wonder if there was any replacement for the second step to speed it up.