Hi all!
I would like to know what is the best practice to add some GEO data to my data. Briefly, I am analyzing some scRNAseq data (human patients divided into complete responders and non responders to treatment). I'd like to add the healthy donors data available here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE126030 to see what are the differences between patients and healthy controls.
My questions are: 1) We have run 5' VDJ scRNAseq in order to perform clonotype analysis on T cells. Szabo PA et al. used 3' scRNAseq. Can I merge the datasets? 2) How would you proceed? Should I analyze them separately or can I merge them into the same object? I have adapted their .txt preprocessed matrices to my data: eg my features are in the format ensembl.symbol (I like keeping both of them). So to let you understand my genes are in these format: "ENSG00000119812.FAM98A". If I concatenate their genes in the same way, all their genes have this pattern: "ENSG00000119812.18.FAM98A". By removing the number in between the first and second point I obtain the same format and if I run:
sum(rownames(sce) %in% rownames(counts)) [1] 8371
where counts is the whole sparse matrix I obtain working on their files, with genes as rownames and cell barcodes as colnames
dim(sce) [1] 8781 31068
I am able now to proceed creating two separate seurat objects, merging them and running integrate.
Is this a proper way to integrate my data with GEO? Should I do additional checks? Or should I analyze them separately?
Thanks
Francesco
You are in the unfortunate, yet typical situation that the biological effect (healthy vs disease) is confounded by data source. Even though there are tools like Seurat's anchoring framework, Liger, Harmony, fastMNN which can integrate data from different sources, it (by best knowledge) assumes that at least some cell types are shared between the datasets. I am therefore not sure whether this is applicable here. Even if you manage though to integrate data then the best you could get I guess would be a unified clustering landscape, so see where the healthies cluster with regard to the disease, but you cannot perform differential analysis, because of the (probably gigantic) batch effect between the datasets, which you cannot regress as disease/healthy is confounded by study. What exactly do you want to answer? There are annotation frameworks such as
SingleR
which can, for every single-cell of your data, find the most similar celltype based a reference dataset. You could cluster the healthy datasets as usual, give them cellular identities, and then withSingleR
for each of your either single-cells or clusters in the disease, see which of those reference clusters are most similar, see http://bioconductor.org/packages/release/bioc/html/SingleR.htmlThat would save you from the difficulty of integration and all these batch effects. Does that make sense here?