Question

t-SNE plot dots that are close together assigned to different clusters.

1

Entering edit mode

8 months ago

leranwangcs ▴ 120

Hi,

I have been having trouble to understand why some dots are so close together in t-SNE plot but they are assigned to different clusters in FindNeighbors() and FindClusters()?

For example below plot:

enter image description here

The most of the cluster 0 (red dots) are in bottom right but there are some are scattered in upper left and seem closer to cluster 3 (purple) and cluster 2 (blue) than to cluster 0.

Then why those dots were still grouped as cluster0?

This is a subset of a larger number of cells. The steps I did are:

# read sample1
sample1 <- Read10X("~/sample1/raw_feature_bc_matrix/")
sample1 <- CreateSeuratObject(counts = sample1 , project = "sample1",min.cells = 3, min.features = 200) 
sample1@meta.data$cellID <- names(sample1$orig.ident)
# subset cells to only keep the cells that we want
sample1 <- subset(sample1,cellID %in% sample1_ID_list))
sample1 <- NormalizeData(sample1)
sample1 <- FindVariableFeatures(sample1, selection.method = "vst", nfeatures = 2000)
sample1 <- ScaleData(sample1, verbose = FALSE)
sample1 <- RunPCA(sampl1, npcs = 30, verbose = FALSE)


# read sample2
sample2 <- Read10X("~/sample2/raw_feature_bc_matrix/")
sample2 <- CreateSeuratObject(counts = sample2, project = "sample2",min.cells = 3, min.features = 200) 
sample2@meta.data$cellID <- names(sample2$orig.ident)
# subset cells to only keep the cells that we want
sample2 <- subset(sample2,cellID %in% sample2_ID_list))
sample2 <- NormalizeData(sample2)
sample2 <- FindVariableFeatures(sample2, selection.method = "vst", nfeatures = 2000)
sample2 <- ScaleData(sample2, verbose = FALSE)
sample2 <- RunPCA(sampl2, npcs = 30, verbose = FALSE)


# read sample3
sample3 <- Read10X("~/sample3/raw_feature_bc_matrix/")
sample3 <- CreateSeuratObject(counts = sample3, project = "sample3",min.cells = 3, min.features = 200) 
sample3@meta.data$cellID <- names(sample3$orig.ident)
# subset cells to only keep the cells that we want
sample3 <- subset(sample3,cellID %in% sample3_ID_list))
sample3 <- NormalizeData(sample3)
sample3 <- FindVariableFeatures(sample3, selection.method = "vst", nfeatures = 2000)
sample3 <- ScaleData(sample3, verbose = FALSE)
sample3 <- RunPCA(sample3, npcs = 30, verbose = FALSE)


# QC step(skipped)

# integrate 3 samples
immune.anchors <- FindIntegrationAnchors(object.list = list(sample1, 
                                                                                               sample2, 
                                                                                               sample3), dims = 1:20)

combined.3.samples <- IntegrateData(anchorset = immune.anchors,dims = 1:20)

DefaultAssay(combined.3.samples) <- "RNA"
combined.3.samples <- NormalizeData(combined.3.samples, normalization.method = "LogNormalize", scale.factor = 10000)
combined.3.samples <- FindVariableFeatures(combined.3.samples, selection.method = "vst", nfeatures = 2000)
combined.3.samples <- ScaleData(combined.3.samples, verbose = FALSE)
combined.3.samples <- RunPCA(combined.3.samples, npcs = 30, verbose = FALSE)
combined.3.samples <- RunUMAP(combined.3.samples, dims = 1:20,reduction = "pca")
combined.3.samples <- RunTSNE(object = combined.3.samples,reduction = "pca")
combined.3.samples <- FindNeighbors(combined.3.samples, reduction = "pca", dims = 1:20)
combined.3.samples <- FindClusters(combined.3.samples, resolution = 0.5)

# plot
DimPlot(combined.3.samples, reduction = "tsne",group.by = "seurat_clusters",label = TRUE,repel = TRUE)

Thank you!
Leran

scRNA clustering • 1.0k views

ADD COMMENT • link updated 8 months ago by LauferVA 4.2k • written 8 months ago by leranwangcs ▴ 120

0

Entering edit mode

Could you please tell us what are your steps prior to tSNE plot and what is your sample (PBMCs or other tissues)? Will you also please tell us if it is subset of clusters from other several clusters? Sample heterogeneity could be one of the factor driving scattered clusters as you have shown here.

ADD REPLY • link 8 months ago by bk11 ★ 2.4k

0

Entering edit mode

Thanks for the suggestion! I have edited my post!

Leran

ADD REPLY • link 8 months ago by leranwangcs ▴ 120

0

Entering edit mode

Your post has sample1 being read in the sample2 code chunk as well. Is that a copy-paste typo or does your code have that error too?

ADD REPLY • link 8 months ago by Ram 43k

0

Entering edit mode

Thanks for pointing that out! I have corrected it.

Leran

ADD REPLY • link 8 months ago by leranwangcs ▴ 120

score 3 · Answer 1 · 2023-08-10

Hey leranwangcs, one goal of tSNE is to reduce data having many different dimensions to a 2D plane for the purposes of visualization.

This concept is not entirely new in the development of thought. Consider, for instance, the plight of cartographers in the 16th century. They wanted to represent the 3D surface of the earth on a 2D object like a sheet of paper.

To do that, various projections were devised, each with its relative strengths and weaknesses. Consider for instance, the well known Mercator Projection. It does a pretty good job in most respects, right? But wait a minute, what about Alaska and Russia? In the map, they appear to be on opposite sides of the earth, but in reality, at the nearest point, Alaska is only about 55 miles (89km) from Russia.

Let's cut to the chase: The bottom line is that any projection of a high dimensional data space onto a lower dimensional surface will produce some amount of deformation; indeed, this is well known with tSNE - see for instance this tutorial.

While having dots from different groups that overlap can be indicative of problems, it may not be all that surprising depending on the dataset... At any rate, the steps you take to assess whether or not those differences are meaningful or arbitrary, THAT is the real trick of it.

Now, with that as background, there are a couple practical suggestions that LChart mentioned elsewhere. specifically, in this line,

combined.3.samples <- RunTSNE(object = combined.3.samples,reduction = "pca")

note that you have defined the clusters in PC space, but are now plotting them in t-SNE space. this is relatively more likely to produce overlap between observations belonging to each cluster. another likely culprit is the use of random initialization for t-SNE. one could also consider defining distinct starting positions for each cluster, which would likely decrease or eliminate the problem. LChart please emend or correct, pro re nata.

score 0 · Answer 2 · 2023-08-10

Since you started from raw_feature_bc_matrix you must remove empty droplets and ambient RNA contamination before downstream analysis, and also perform standard QC. See these link-

http://bioconductor.org/books/3.14/OSCA.advanced/droplet-processing.html

https://satijalab.org/seurat/articles/pbmc3k_tutorial.html

For the data integration part, you can follow Seurat V5 approach as described in the link below and check which integration method works the best for your data.

https://satijalab.org/seurat/articles/seurat5_integration