Question

scRNA-seq analysis of two data-sets

1

Entering edit mode

4.6 years ago

piyushjo ▴ 700

Hi,

So I have two datasets from two different yet related cell lines: pre and post relapse cancer cell line from same patient.

I have performed single cell sequencing for both of them with the hypothesis that I will be able to find rarely expressed cells in either that are key to becoming relapse.

To do that I am following two protocols:

1) I am following the Seurat tutorial of integrating simulated and normal pbmc

https://satijalab.org/seurat/v3.1/immune_alignment.html

2) I am following this second tutorial where there is no integration steps involved.

https://davetang.org/muse/2018/01/24/merging-two-10x-single-cell-datasets/

However both these approaches give two different results! Using 1) I am getting more cells from two different lines to be similar with very few cells from each to be different. Using 2) I am getting opposite, yet expected, result that most of the cells are different with only few common cells.

Which one is the correct way of analyzing? Are two approaches giving me two different results become they are doing different things? 1) is finding common genes and 2) finding distinct genes?

seurat scRNA-seq • 5.0k views

ADD COMMENT • link updated 4.6 years ago by jared.andrews07 ★ 16k • written 4.6 years ago by piyushjo ▴ 700

score 3 · Accepted Answer · 2019-09-05

3

Entering edit mode

4.6 years ago

jared.andrews07 ★ 16k

These are very different approaches. The first tries to account for technical variation between the two sets, mapping similar cells between the two to each other. The second literally just merges the columns and rows of the two sets together into one object - it's not doing any special normalization. It's just a straight merging of data.

It's tough for us to say which is more appropriate - the first may help you to identify cell populations that truly differentiate the two, but it could be blowing away real differences due to how Seurat's integration works. It will force populations that aren't similar together if there aren't many overlapping cell types between the two samples. The second may be revealing significant technical variation or batch effects, or it could just be that your cell lines are quite different from each other. You are in the best position to determine if this is the case or not - we know nothing about your samples.

This is where the true difficulty of RNA-seq analysis lies - nobody is really going to be able to tell which is truly correct.

You might try other integration methods if you feel you have batch effects or technical variation that needs to be addressed. I've found the SeuratWrapper around fastMNN to be quite good, personally, as it handles cases where samples don't have much overlap in terms of cell types much more appropriately.

ADD COMMENT • link 4.6 years ago by jared.andrews07 ★ 16k

0

Entering edit mode

Ok I understand your clarification, but what would you suggest doing when I really want to find something that connects the two sets that I know for sure are overall different. For example, if I have a differentiating neuronal culture from Day 1 and Day 2 and I am really interested in finding the transient population in Day1 that become dominant population in Day2, which "merging" method would be most beneficial.

I will also perform trajectory analysis after basic seurat work flow to find markers and visualization.

ADD REPLY • link 4.6 years ago by piyushjo ▴ 700

1

Entering edit mode

In such a case as that, if I wasn't worried about batch effects and my starting cells were the same, I'd just straight merge the data. I don't know much about neuronal differentiation, but I expect that there'd still be significant overlap. Being different from each other is fine - you'd just expect that they aren't completely different after such a short time.

If you can avoid integration, do so. Unless you know you have confounding batch or other technical effects, there is no reason to complicate your analysis, and it may even "correct" out real biological differences.

ADD REPLY • link 4.6 years ago by jared.andrews07 ★ 16k

0

Entering edit mode

Thanks for your insight!

ADD REPLY • link 4.6 years ago by piyushjo ▴ 700

0

Entering edit mode

Why is it better to avoid integration as a whole? When should one use integration then? My analysis with and without integration produces very similar results. Does this give me confidence in the integration analysis or does it suggest that it is not necessary? I am comparing a control and treatment population, and like the original author, am interested in how the populations change and whether one population turns into another following treatment.

ADD REPLY • link 3.3 years ago by aa123 • 0

0

Entering edit mode

Why is it better to avoid integration as a whole?

It is generally a pain and can sometimes obfuscate truly unique populations by smacking them together. It's not a perfect process, though improvements have been made since I initially posted this. As well as benchmarking studies that show which methods perform relatively well. For differential expression, you should use the raw counts and include the bias in your model anyway. In short, it can at times yield misleading UMAP/tSNE plots.

When should one use integration then?

When technical batch effects are evident and obviously skew your dimensionality reduction, thereby making plots difficult to interpret and clearly biased. Samples prepped and run on the same day are unlikely to have such brazen technical biases.