I have a matrix of single-cell gene expression, containing gene names and cell (sample) IDs.
The SINGuLAR package manual includes an outlier identification function (identifyOutliers) that does the following, in their own words:
Outlier analysis is based on the assumption that samples (cells) of the same type also have a set of commonly-expressed genes.
The outlier algorithm iteratively trims the low-expressing genes in an expression file until 95% of the genes that remain are expressed above the Limit of Detection (LoD) value that you set for half of the samples.
The assumption is that the set of samples contains less than 50% outliers. This means that subsequent calculations will only include the half of the samples that have the highest expression for the trimmed gene list.
The trimmed gene list represents genes that are present above the LoD in at least half the samples or the most evenly expressed genes—though they might not be the highest or lowest in their expression value.
For the 50% of the samples that remain, a distribution is calculated that represents their combined expression values for the gene list defined above. For this distribution, the median represents the 50th percentile expression value for the set of data.
I am trying to find this list of genes that is expressed in at least half of the cells. I assume they mean that the genes in this list must ALL be expressed in the same subset of cells, i.e., these genes must co-express in a subset of at least 48 cells.
I built the following matrix: Each row is a logical vector indicating the samples in which a gene was detected. Genes must appear in a minimum of 48 samples out of 96 (half) to make it this far (be included in this matrix). i.e., all genes in this matrix appear in 48 or more samples.
Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 ... gene1 TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE gene2 FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE gene3 TRUE TRUE FALSE TRUE FALSE TRUE TRUE FALSE gene4 FALSE FALSE TRUE FALSE TRUE FALSE TRUE TRUE gene5 TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE gene6 FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE gene7 TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE gene8 TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
I could also say I have an array of samples for each gene in which the latter was expressed, such as:
> gene1  "Sample1" "Sample3" "Sample4" "Sample5"
How can I obtain the *largest* set of genes (rows in the matrix) that belong to a common set of *at least* 48 samples (columns)? (Assuming my interpretation of the function is correct).