I have an scRNA-seq dataset, and I want to look at the proportional variance between "samples" or even different datasets, batches, and so on.
Some people do batch-effect corretion, and then they show a bar-plot of "percent explained variance by batch".
I want to do something similar, that is, find the percent explained variance, between different conditions and comparisons.
My protocol so far, is similar to finding R in linear regression:
- Step 1: SSbetween = Find sum of squares for all samples
- Step 2: SSwithin= Find sum of squares within samples
- Step 3: % variance explained = SSbetween - SSwithin / SSbetween or something similar.
The problem is that for each sample, there are 20.000 genes, each with their own variance. So how do I estimate the total variance of all genes and a group of samples.
I know how to do it for one gene, this is simple the sum( mean - xi )² where xi is the expression of the gene in sample i, but since there are many genes, each has their own variance. How do I calculate the total sample variance for all genes?
The simplest would be to sum them, but this would skew the variance for a few outlier samples with high expression / variance. What is the standard way to estimate group variance in batch correction or similar situations?