I have 4 samples; two related tissues from two different donors. I ran cellranger count on all four samples, and used cellranger aggr to combine all the data.
Then I gave the filtered matrix data from each sample to Seurat, (not the matrix data from the aggregation) and had it integrate the data.
The 10x aggr method puts each library in its own cluster. Seurat's integration puts all the cells from all the samples into one big cluster.
I was wondering if anyone had observed this before, or if anyone had an idea as to which UMAP is likely to be more reliable. I think that Seurat's algorithm is more sophisticated, but maybe the 10X people understand their data better, and their way is better for their libraries? Is there a way to change my command lines to make the two ways more similar?
10XGenomcs command lines
cellranger count --id=donor1_type1 --fastqs=/projects/Illumina/200310_NB551398_0049_AHCN2KBGXC/mkfastq/outs/fastq_path/HCN2KBGXC/donor1_type1/ --transcriptome=/projects/Illumina/W/10xGenomics/refdata-cellranger-1.1.0/GRCh38_96/GRCh38/ --localcores=30
cellranger aggr --id=all_200319_aggregate --csv=all_200319_aggr.csv
Seurat R commands, taken from here: https://satijalab.org/seurat/v3.1/immune_alignment.html
data <- Read10X(data.dir = data_dir)
pbmc <- CreateSeuratObject(counts = data, project = "donor1_type1", min.cells = 3, min.features = 200)
pbmc <- NormalizeData(pbmc)
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
immune.anchors <- FindIntegrationAnchors(object.list = list(donor1_type1, donor1_type2, donor2_type1, Donor2_type2), dims = 1:20)
combined.all <- IntegrateData(anchorset = immune.anchors, dims = 1:20)
rm(immune.anchors)
DefaultAssay(combined.all) <- "integrated"
combined.all <- ScaleData(combined.all, verbose = FALSE)
combined.all <- RunPCA(combined.all, npcs = 30, verbose = FALSE)
combined.all <- RunUMAP(combined.all, reduction = "pca", dims = 1:20)
combined.all <- FindNeighbors(combined.all, reduction = "pca", dims = 1:20)
combined.all <- FindClusters(combined.all, resolution = 0.5)
I expect differences, but I would have hoped that the same tissues would cluster together.
With only 4 samples I'm not sure that batch effect is really relevant. I guess each of the donors would be its own batch. I think these samples were received days apart, they definitely would be using the same library prep kit.
So when would be the right circumstances to use Seurat merge versus Seurat integrate? Is integrate only for vary disparate data sets? Like datasets form different sources?
I use integrate most times. It's very rare that you do not see any batch differences between samples. You have to see what makes sense for your experiment. Look at the data both ways and check the expression of important population markers. In the Seurat pancreas integration vignettes, there are a few plots showing UMAPs with and without integration. They are profiling very diverse populations, so the cell types do cluster together, but you can still see segregation within each population based on the library type.