I have an scRNA-seq dataset, and I want to look at the proportional variance between "samples" or even different datasets, batches, and so on.

Some people do batch-effect corretion, and then they show a bar-plot of "percent explained variance by batch".

I want to do something similar, that is, find the percent explained variance, between different conditions and comparisons.

My protocol so far, is similar to finding R in linear regression:

- Step 1: SSbetween = Find sum of squares for all samples
- Step 2: SSwithin= Find sum of squares within samples
- Step 3: % variance explained = SSbetween - SSwithin / SSbetween or something similar.

The problem is that for each sample, there are 20.000 genes, each with their own variance. So how do I estimate the **total** variance of all genes and a group of samples.

I know how to do it for **one** gene, this is simple the sum( mean - xi )² where xi is the expression of the gene in sample i, but since there are many genes, each has their own variance. How do I calculate the total sample variance for all genes?

The simplest would be to sum them, but this would skew the variance for a few outlier samples with high expression / variance. What is the standard way to estimate group variance in batch correction or similar situations?

Can you link a reference for this?

I don't have any references / protocols. Just taking inspiration from how % explained variance is calculated in PCA and Linear Regression

I was referring to that sentence. What you probably mean is the % variance explained by each principal component, no?

No. See f.ex. here:

https://www.biorxiv.org/content/biorxiv/early/2020/10/28/2020.10.27.358283/F3.large.jpg

https://www.biorxiv.org/content/10.1101/2020.10.27.358283v1.full

% explained variance by batch is generally mentioned in papers on batch correction.

"

We first considered the proportion of variance explained by treatment and batch effects before and after batch correction across all variables using pRDA. Efficient batch correction methods should generate data with a smaller proportion of batch associated variance and larger proportion of treatment variance compared to the original data."It seems they are using limma- removeBatchEffect and ComBat . ComBat returns % explained variance by batch but I don't understand how they calculate the total variance because they first calculate variance per each gene