So I have two datasets from two different yet related cell lines: pre and post relapse cancer cell line from same patient.
I have performed single cell sequencing for both of them with the hypothesis that I will be able to find rarely expressed cells in either that are key to becoming relapse.
To do that I am following two protocols:
1) I am following the Seurat tutorial of integrating simulated and normal pbmc
However both these approaches give two different results! Using 1) I am getting more cells from two different lines to be similar with very few cells from each to be different. Using 2) I am getting opposite, yet expected, result that most of the cells are different with only few common cells.
Which one is the correct way of analyzing? Are two approaches giving me two different results become they are doing different things? 1) is finding common genes and 2) finding distinct genes?
These are very different approaches. The first tries to account for technical variation between the two sets, mapping similar cells between the two to each other. The second literally just merges the columns and rows of the two sets together into one object - it's not doing any special normalization. It's just a straight merging of data.
It's tough for us to say which is more appropriate - the first may help you to identify cell populations that truly differentiate the two, but it could be blowing away real differences due to how Seurat's integration works. It will force populations that aren't similar together if there aren't many overlapping cell types between the two samples. The second may be revealing significant technical variation or batch effects, or it could just be that your cell lines are quite different from each other. You are in the best position to determine if this is the case or not - we know nothing about your samples.
This is where the true difficulty of RNA-seq analysis lies - nobody is really going to be able to tell which is truly correct.
You might try other integration methods if you feel you have batch effects or technical variation that needs to be addressed. I've found the SeuratWrapper around fastMNN to be quite good, personally, as it handles cases where samples don't have much overlap in terms of cell types much more appropriately.
Ok I understand your clarification, but what would you suggest doing when I really want to find something that connects the two sets that I know for sure are overall different. For example, if I have a differentiating neuronal culture from Day 1 and Day 2 and I am really interested in finding the transient population in Day1 that become dominant population in Day2, which "merging" method would be most beneficial.
I will also perform trajectory analysis after basic seurat work flow to find markers and visualization.
In such a case as that, if I wasn't worried about batch effects and my starting cells were the same, I'd just straight merge the data. I don't know much about neuronal differentiation, but I expect that there'd still be significant overlap. Being different from each other is fine - you'd just expect that they aren't completely different after such a short time.
If you can avoid integration, do so. Unless you know you have confounding batch or other technical effects, there is no reason to complicate your analysis, and it may even "correct" out real biological differences.
Why is it better to avoid integration as a whole? When should one use integration then?
My analysis with and without integration produces very similar results. Does this give me confidence in the integration analysis or does it suggest that it is not necessary? I am comparing a control and treatment population, and like the original author, am interested in how the populations change and whether one population turns into another following treatment.
It is generally a pain and can sometimes obfuscate truly unique populations by smacking them together. It's not a perfect process, though improvements have been made since I initially posted this. As well as benchmarking studies that show which methods perform relatively well. For differential expression, you should use the raw counts and include the bias in your model anyway. In short, it can at times yield misleading UMAP/tSNE plots.
When should one use integration then?
When technical batch effects are evident and obviously skew your dimensionality reduction, thereby making plots difficult to interpret and clearly biased. Samples prepped and run on the same day are unlikely to have such brazen technical biases.
Ok I understand your clarification, but what would you suggest doing when I really want to find something that connects the two sets that I know for sure are overall different. For example, if I have a differentiating neuronal culture from Day 1 and Day 2 and I am really interested in finding the transient population in Day1 that become dominant population in Day2, which "merging" method would be most beneficial.
I will also perform trajectory analysis after basic seurat work flow to find markers and visualization.
In such a case as that, if I wasn't worried about batch effects and my starting cells were the same, I'd just straight merge the data. I don't know much about neuronal differentiation, but I expect that there'd still be significant overlap. Being different from each other is fine - you'd just expect that they aren't completely different after such a short time.
If you can avoid integration, do so. Unless you know you have confounding batch or other technical effects, there is no reason to complicate your analysis, and it may even "correct" out real biological differences.
Thanks for your insight!
Why is it better to avoid integration as a whole? When should one use integration then? My analysis with and without integration produces very similar results. Does this give me confidence in the integration analysis or does it suggest that it is not necessary? I am comparing a control and treatment population, and like the original author, am interested in how the populations change and whether one population turns into another following treatment.
It is generally a pain and can sometimes obfuscate truly unique populations by smacking them together. It's not a perfect process, though improvements have been made since I initially posted this. As well as benchmarking studies that show which methods perform relatively well. For differential expression, you should use the raw counts and include the bias in your model anyway. In short, it can at times yield misleading UMAP/tSNE plots.
When technical batch effects are evident and obviously skew your dimensionality reduction, thereby making plots difficult to interpret and clearly biased. Samples prepped and run on the same day are unlikely to have such brazen technical biases.
Thank you for circling back to this thread. I appreciate your insight.