Question

Integrate GEO data and my own scRNAseq data

1

Entering edit mode

3.4 years ago

fmazzio1 ▴ 10

Hi all!

I would like to know what is the best practice to add some GEO data to my data. Briefly, I am analyzing some scRNAseq data (human patients divided into complete responders and non responders to treatment). I'd like to add the healthy donors data available here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE126030 to see what are the differences between patients and healthy controls.

My questions are: 1) We have run 5' VDJ scRNAseq in order to perform clonotype analysis on T cells. Szabo PA et al. used 3' scRNAseq. Can I merge the datasets? 2) How would you proceed? Should I analyze them separately or can I merge them into the same object? I have adapted their .txt preprocessed matrices to my data: eg my features are in the format ensembl.symbol (I like keeping both of them). So to let you understand my genes are in these format: "ENSG00000119812.FAM98A". If I concatenate their genes in the same way, all their genes have this pattern: "ENSG00000119812.18.FAM98A". By removing the number in between the first and second point I obtain the same format and if I run:

sum(rownames(sce) %in% rownames(counts)) [1] 8371

where counts is the whole sparse matrix I obtain working on their files, with genes as rownames and cell barcodes as colnames

dim(sce) [1] 8781 31068

I am able now to proceed creating two separate seurat objects, merging them and running integrate.

Is this a proper way to integrate my data with GEO? Should I do additional checks? Or should I analyze them separately?

Thanks

Francesco

R RNA-Seq sequencing • 1.3k views

ADD COMMENT • link 3.4 years ago by fmazzio1 ▴ 10

0

Entering edit mode

You are in the unfortunate, yet typical situation that the biological effect (healthy vs disease) is confounded by data source. Even though there are tools like Seurat's anchoring framework, Liger, Harmony, fastMNN which can integrate data from different sources, it (by best knowledge) assumes that at least some cell types are shared between the datasets. I am therefore not sure whether this is applicable here. Even if you manage though to integrate data then the best you could get I guess would be a unified clustering landscape, so see where the healthies cluster with regard to the disease, but you cannot perform differential analysis, because of the (probably gigantic) batch effect between the datasets, which you cannot regress as disease/healthy is confounded by study. What exactly do you want to answer? There are annotation frameworks such as SingleR which can, for every single-cell of your data, find the most similar celltype based a reference dataset. You could cluster the healthy datasets as usual, give them cellular identities, and then with SingleR for each of your either single-cells or clusters in the disease, see which of those reference clusters are most similar, see http://bioconductor.org/packages/release/bioc/html/SingleR.html

That would save you from the difficulty of integration and all these batch effects. Does that make sense here?

ADD REPLY • link 3.4 years ago by ATpoint 81k

score 0 · Answer 1 · 2020-12-16

Yep, it makes great sense.

The cell type in both the datasets is the same: CD3+ T cells. Regarding my question: we have found with flow cytometry that T cells in responders have almost the same clusters' landscape of healthy donors and that it's completely different in non responders. Since I found differences between responders and non responders in scRNAseq, I was wondering if I can confirm what I found in flow-cytometry with scRNAseq comparing non-responders, responders and healthy donors.

I'll follow your suggestion and try SingleR.

Thank you!

Francesco