Question: Single-cell RNA-seq Outlier Identification : How can I find the subset of genes that are expressed in a common set of samples (cells) ? (Fluidigm SINGuLAR package)
2
4.6 years ago by
gaelgarcia150
UK
gaelgarcia150 wrote:

I have a matrix of single-cell gene expression, containing gene names and cell (sample) IDs.

The SINGuLAR package manual includes an outlier identification function (identifyOutliers) that does the following, in their own words:

Outlier analysis is based on the assumption that samples (cells) of the same type also have a set of commonly-expressed genes.

The outlier algorithm iteratively trims the low-expressing genes in an expression file until 95% of the genes that remain are expressed above the Limit of Detection (LoD) value that you set for half of the samples.

The assumption is that the set of samples contains less than 50% outliers. This means that subsequent calculations will only include the half of the samples that have the highest expression for the trimmed gene list.

The trimmed gene list represents genes that are present above the LoD in at least half the samples or the most evenly expressed genes—though they might not be the highest or lowest in their expression value.

For the 50% of the samples that remain, a distribution is calculated that represents their combined expression values for the gene list defined above. For this distribution, the median represents the 50th percentile expression value for the set of data.

I am trying to find this list of genes that is expressed in at least half of the cells. I assume they mean that the genes in this list must ALL be expressed in the same subset of cells, i.e., these genes must co-express in a subset of at least 48 cells.

I built the following matrix: Each row is a logical vector indicating the samples in which a gene was detected. Genes must appear in a minimum of 48 samples out of 96 (half) to make it this far (be included in this matrix). i.e., all genes in this matrix appear in 48 or more samples.

```          Sample1  Sample2  Sample3  Sample4 Sample5 Sample6  Sample7  Sample8  ...
gene1  TRUE     FALSE    TRUE     TRUE    TRUE    FALSE    FALSE    FALSE
gene2  FALSE    TRUE     FALSE    TRUE    FALSE   TRUE     TRUE     FALSE
gene3  TRUE     TRUE     FALSE    TRUE    FALSE   TRUE     TRUE     FALSE
gene4  FALSE    FALSE    TRUE     FALSE   TRUE    FALSE    TRUE     TRUE
gene5  TRUE     TRUE     TRUE     TRUE    TRUE    FALSE    TRUE     TRUE
gene6  FALSE    FALSE    TRUE     FALSE   FALSE   TRUE     TRUE     TRUE
gene7  TRUE     TRUE     FALSE    FALSE   TRUE    TRUE     FALSE    FALSE
gene8  TRUE     TRUE     TRUE     TRUE    FALSE   FALSE    FALSE    FALSE```

I could also say I have an array of samples for each gene in which the latter was expressed, such as:

```    > gene1
[1] "Sample1"  "Sample3"  "Sample4"  "Sample5"```

How can I obtain the *largest* set of genes (rows in the matrix) that belong to a common set of *at least* 48 samples (columns)? (Assuming my interpretation of the function is correct).

subset rna-seq matrix R single-cell • 2.6k views
modified 4.6 years ago by Antonio R. Franco4.1k • written 4.6 years ago by gaelgarcia150
1
4.6 years ago by
Asaf6.1k
Israel
Asaf6.1k wrote:

What you're trying to do is bi-clustering. There are tools like SAMBA that can find regions of the matrix that "looks alike". I don't know if you can force a minimum of 4 columns but one tool might have this option. There is a review that compares several tools here: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3447720/

Thanks, Asaf. I have updated my question to better reflect what I am trying to do with the data. I am not sure if this is biclustering?

Oh, single cells.

I still think biclustering is relevant. However, if you have 8 samples you can count the number of genes for all 4 choose 8 combination and find the 4 samples with the highest number of genes having all TRUE.
I didn't quite get what you're trying to achieve here but I have a feeling that probability can make life much easier. Try to figure out what is the appropriate distribution of the gene expression values under your assumption and see if the data agree.

Thanks. The problem with the 4 choose 8 combinations is that this set of samples expressing the same set of genes can be 4 or greater than 4. As long as the subset of genes are expressed in at least half of the cells, but these must be the same cells. This is the part that makes it really confusing to me, how to determine this subset of cells and genes.

0
4.6 years ago by
Antonio R. Franco4.1k wrote:

I wonder if you can import this table into R. Then you can calculate the number of TRUE in each column by summing that number, since TRUE into R has a value of 1, and FALSE is 0

something like apply(nameoftable, 1, sum) could be ok