One of the method that is used to find malignant cells in tumor scRNA-seq data is infercnv. This, this, and this papers used infercnv outputs to find cancer cells among other epithelial cells.

They separate cancer and non-cancer cells based on 2 thresholds: CNV score and CNV correlation. CNV score for each cell is computed as the mean of squares of residual expression across the genes. CNV correlation, however, is computed as a correlation between the CNV profile of each cell and the average CNV profile of all cells from the corresponding tumor, except for those classified by gene expression as non-malignant.

I am not sure how to calculate the latter. What is "CNV profile"? And how to compute "the average CNV profile of all cells from the corresponding tumor, except for those classified by gene expression as non-malignant"?

I was directed to this issue on their github, but no one clarified how to compute CNV correlation there.

thank you. Just to be sure, by "InferCNV gives a CNV score per individual cell." you mean infercnv_obj@expr.data which is cell by gene matrix right? So I'll need to compute a score for each cell as a mean of squares across all genes?

Sorry - CNV score per

gene per cell; so you have a`(n_cell, n_gene)`

matrix of residual scores (or transposed). The "mean CNV score" would be therow means.thank you. This is what I've done:

While CNV corr makes sense CNV score looks almost the same for all cells. What am I doing wrong?

So:

(1) inferCNV should provide both an "expected" expression value as well as a "residual" value. It's the residual value that's used to actually "call" CNVs (a run of many genes with a high residual would implicate an amplifcation, and with low residuals would implicate a deletion).

(2) The sum of the residuals across genes within cells would be

kind ofand estimate of total burden -- but you're better off using the calls themselves for this since there will be lots of noise from the many genes with small residuals.(3) The cancer CNV vector and the cell correlations should be built from scaled residuals.

(4) I'm assuming you're rescalling to [1,-1] because other publications do so? Otherwise what is the justification for rescaling the residuals?

I would also recommend saving the raw and residualized expression values throughout the various steps of denoising; as you may find that one set of values outperforms others.