I'm new to this forum. I'm doing single cell RNA-seq for one KO dataset (KO) and two CT datasets (CT1 and CT2) and I'm wondering if there is a way to tell if KO is more similar to CT1 or CT2? I did some search but couldn't find anything that's similar to what I want to do, maybe I didn't search for the right term.
You should also keep in mind that since you have only one KO sample, you cannot distinguish between the biological and technical variability that would come from profiling KO samples. I.e. for the two control samples that you expect to be very similar, you can gauge the variability introduced by the sampling procedure and individual experiments by simply comparing them to each other; with the KO sample you don't have that information.
If you had multiple KO samples, you could generate "pseudo-bulk" samples, e.g. sum up all reads per gene across all cells of a given sample and do PCA at a more birds-eye-view level. However, Jautis is right in pointing out that individual subsets of cells may be more similar to each other across sample types than others. E.g. if this is a sample from a solid tumor, there may be benign cells in the mix, which I would expect to be more similar to the controls than the actual tumor cells. But then again, we don't know the details of the experiment here; I just wanted to underscore Jautis' point that there's not really one ready-made solution or score to address your question and you may have to take a step back and figure out what exactly your data contains and how that relates to the insights you want to derive.
Hi Jautis, Thanks for the reply. I guess by saying "similar" is like you have a score-like thing, i.e. the KO score is higher in CT1 than CT2, which means CT1 and KO shares more similarities. The clustering map looks the same for all of them.
Your idea of using UMAP or tSNE for the three samples sounds good too! Do you know if it is possible do to for single cell RNA-seq samples though? I've done it for RNA-seq but never for single cell RNA-seq.
No, there's no simple score. That's simply because similarity can be defined so many different ways (distance from the centroid, nearest cellular neighbors, differentially expressed genes, effect size distributions, etc), which can't be calculated in a single "similarity" statistic although you can combine lines of evidence if the differences are small or point to obvious structural differences.
And yes, you can do UMAP/tSNE with cells from multiple samples. You just need to add the sample to the metadata and then you color by that as opposed to cell cluster or another variable. This should be easily implemented in seurat if you have a single seurat object with all data (note, you don't want to integrate across samples). And I'm pretty sure you would have had no reason to use UMAP on normal RNA-seq data -- unless if you had thousands of samples there's nothing there that would warrant the additional dimensionality reduction and clustering as opposed to normal PCA.
Ah yes I mean I've done PCA for RNA-seq, not UMAP/tSNE. Thanks for the clarification!
The two are very different forms of dimension reduction and visualization! You can do a PCA on single cell data, but it usually looks really weird. UMAP projections typically cluster cell types together unless something weird is happening in the dataset