Question

Integrating scRNA-seq datasets with different numbers of biological replicates

0

Entering edit mode

4.4 years ago

mosquitolzw • 0

Dear all,

I'm looking to integrate scRNA-seq datasets generated from different experiments, in which some of them have multiple biological replicates and others only have one. I'm thinking of downsampling the experiment with multiple replicates before integration but not sure about the best approaches to do so and the caveats that comes with each approach.

May I get some advice from you please?

Thank you, Cat

RNA-Seq • 2.3k views

ADD COMMENT • link updated 4.4 years ago by Biological information research group of Harbin m… ▴ 10 • written 4.4 years ago by mosquitolzw • 0

0

Entering edit mode

Why do you want to remove data? Are you integrating to remove batch/technical effects?

ADD REPLY • link 4.4 years ago by jared.andrews07 ★ 16k

0

Entering edit mode

I'm not integrating to remove batch effects, rather to do an analysis by integrating datasets obtained across different species and I don't see the biological replicates can help here. Let's say I have datasets from 4 (biological replicates) mice and 1 dataset from human, does it make sense to keep all 4 mice datasets if they are similar? Wouldn't be simpler to compare 1 mouse (downsampled) with 1 human sample?

ADD REPLY • link 4.4 years ago by mosquitolzw • 0

0

Entering edit mode

I really doubt comparing this way across species is going to work out the way you'd hope. What is your end goal? Are you trying to compare cell types between the species?

Replicates help by increasing your power for differential expression analysis between clusters/cell types, increasing sensitivity for rare cell populations, etc. This is especially true for sparse datasets (like scRNA). There is no benefit to removing them by default. If they are from different labs, have significant technical effects, etc, then removing them or comparing each mouse set individually to your human sample may make sense. It's tough to say without more info or the experimental objectives.

ADD REPLY • link 4.4 years ago by jared.andrews07 ★ 16k

0

Entering edit mode

My end goal is to compare the heterogeneity in the same cell type from different species. In terms of sensitivity for rare cell populations, there are other things to consider such as sampling frequency, e.g. a dataset of cells from the whole organ vs. a dataset of certain type of cells isolated from the organ, as well as different sequencing depths from different experiments. What I'm thinking is not really to remove data, rather to generate a representative dataset from the biological replicates to go on to compare with other datasets with only one replicate. Could you give me some advice re the methodology to do this or you still think keeping all the replicates will be better? Taking computing cost into consideration as well?

ADD REPLY • link 4.4 years ago by mosquitolzw • 0

0

Entering edit mode

If heterogeneity within a cell type is your goal, I don't think removing data is a good move. To clarify, you want to compare heterogeneity in each cell type within either human or mouse? Or between the same cell type for each species? The latter seems rather tricky in my eyes due to different genes, etc.

Downsampling is fine for certain visualizations or quick tests, but otherwise, there is rarely a good reason to do so with scRNA-seq data. Saving a few hours isn't really worth the potential loss of valuable information. If you have no batch/technical effects, you can just merge all your replicates together, there's no need to actually do integration.

ADD REPLY • link 4.4 years ago by jared.andrews07 ★ 16k

0

Entering edit mode

I want to compare the same cell type cross different species and I'm planning to use homologene to standardise the gene names.

Yes, I was thinking about difficulties in visualisation when the vast majority of points are from one experiment and just replicates occupying space. I agree that including all the datasets is better. As for batch correction, unless I know how the samples were processed, I generally assume batch correction is needed, is it a reasonable approach?

ADD REPLY • link 4.4 years ago by mosquitolzw • 0

1

Entering edit mode

That seems tough, as the same gene doesn't always have the same expression profile or function in different species. But I guess that's kind of the point.

If you can avoid batch correction, you should. But yeah, if they're from different labs/datasets, then you'll probably need to integrate them. I recommend fastMNN from batchelor. It also has a Seurat wrapper for easy use and is significantly less heavy handed than Seurat's method (plus it provides the amount of variation removed from each batch).