Entering edit mode
21 months ago
tujuchuanli
▴
130
Hi, I wanted to analyze the scRNA-seq data on GEO (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE190772). It is the data for breast cancer bone metastases. The data is log transformed counts. I did it as following. There is no err occurred during processing. The Umap looks OK and I can see the separated cell populations on the Umap plot. However, the cluster annotation is weird. The clusters was mixed. Every separated cell population on the Umap plot was comprised cells from almost all clusters. Do you know how to figure it out? Thanks
Belowing is my code:
data = read.table(file = "GSE190772_BoM_logCounts.txt", sep = "\t", header = T)
BoM_SO = CreateSeuratObject(counts = data, min.cells = 3, min.features = 300, project = "BoM")
BoM_SO.filter = subset(BoM_SO, subset = nFeature_RNA > 400 & nFeature_RNA < 8000 & percent.ribo < 0.2 & percent.mito < 0.2)
BoM_SO.filter@assays$RNA@data = as.matrix(data)
BoM_SO.filter@misc[["seurat_data"]] = as.matrix(BoM_SO.filter@assays$RNA@data)
BoM_SO.filter = FindVariableFeatures(object = BoM_SO.filter, mean.function = ExpMean, dispersion.function = LogVMR, dispersion.cutoff = c(0.05, 5), nfeatures = 3000)
BoM_SO.filter = ScaleData(object = BoM_SO.filter, do.scale = T,do.center = T)
BoM_SO.filter = RunPCA(BoM_SO.filter)
BoM_SO.filter = FindNeighbors(BoM_SO.filter, dims = 1:50)
BoM_SO.filter = FindClusters(BoM_SO.filter, resolution = 1)
BoM_SO.filter = RunUMAP(BoM_SO.filter, dims = 1:50, do.fast = TRUE)
DimPlot(BoM_SO.filter, label = T, sizes.highlight = 5, pt.size = 1)
can you show us the plot?
Hi jv,
Bellowing is my Umap plot. You can see the well separated cell population and mixed cluster annotations.
it is clear that
DimPlot
is not coloring the points be cluster identity. This is all the more clear byNA
in the legend. I don't know what default metadata vectorDimPlot
uses, but you can specify which metadata contains the clusters by setting thegroup.by
parameter. As a first pass, trygroup.by = 'seurat_clusters'
. Else, examine your meta data before runningDimPlot
to see which vector contains the cluster ids.Hi jv,
Thank you very much to provide precise suggestion. I checked the metadata table (BoM_SO.filter@meta.data) and found that many “NA”s were assigned to the cells at the end of the Seurat_clusters (BoM_SO.filter@meta.data$Seurat_clusters). I don`t know what happen to the Seurat_clusters. However the cluster ids in BoM_SO.filter@meta.data$RNA_snn_res.1 seemed OK and group by RNA_snn_res.1 in Umap was also normal.
I compared the cluster id in Seurat_clusters and RNA_snn_res.1 and found that it seemed that some cluster ids in Seurat_clusters were omitted then the resting cluster ids were directly appended to the omitted position. For example the cluster id in RNA_snn_res.1 is “1, 2, 3, 4, 5, 6, 7” for cell A, B, C, D, E, F and G. However, in Seurat_clusters the order is “1, 2, 4, 5, 6, 7, NA” for cell A, B, C, D, E, F and G.
I am not sure that I used RNA_snn_res.1 instead of Seurat_clusters in Umap is right and what happen to Seurat_clusters. Do you know how can I make sure of it?
Hi, there are a few things that I don't understand from your code.
You are working with log-transformed counts, which I assume are also normalized (this is a big assumption). So, again in theory and based on assumptions, they should have been filtered by the authors.
You input it in a
Seurat
object as counts and then you filter the object. Then you reassign the original log count table - that is, before filtering - to the object'sdata
slot. You also assigned to another slot inmisc
called "seurat_data". Why? It doesn't sound likeSeurat
should even allow you to do that (in theory you are supplying a count table with a smaller number of cells than the rest of the object references).Moreover by passing it first to
as.matrix()
you are causing two issues: 1) you are replacing an otherwise sparse matrix with a dense one, making the object balloon in size for apparently no reason. 2) issue 1 is not a big problem in itself, but in this specific case this table has the gene symbol (a character vector) at the end of the table. So by transforming the data into a matrix, you are effectively coercing all your numerics to become characters ("0.00000" instead of 0). A character matrix is not going to get you anywhere.I downloaded the file from GEO and applied the same workflow and it threw an error. When omitting the problematic lines (the ones where you replace the
data
and themisc
slots) it runs well with no errors (see UMAP). However, you still miss the gene symbols by doing this.You should do first:
Only then you build your
Seurat
object:and you can go on with filtering, variable features, scaling and PCA, etc and can even plot genes now that we have them in the row names.
Hi Giuseppe,
Thank you very much to provide precise suggestions. It teached me a lot. I tried your method, and it came out to be fine and got the Umap which was very similar with your plot.
Since some part of my codes referred to the online information. I didn`t sure whether my understanding or what they said was right. Bellowing is my reference:
Reference link:
Belowing is some my own understanding to your questions and a few further questions:
You are working with log-transformed counts, which I assume are also normalized (this is a big assumption). So, again in theory and based on assumptions, they should have been filtered by the authors.
--Yes, I agreed with you that the cells were filtered by the author. It is not necessary to filter it again. I just applied a very loose criterion to filter it ~~.
You input it in a Seurat object as counts and then you filter the object. Then you reassign the original log count table - that is, before filtering - to the object's data slot.
--since the input matrix is normalized and we didn`t need to normalize it again. We need to overwrite the normalized and log-transformed matrix into BoM_SO.filter@assays$RNA@data which stored the normalized and log-transformed matrix after running “NormalizeData” in seurat object. You can see reference link 1# and 4# for the further information.
You also assigned to another slot in misc called "seurat_data". Why?
--To be honest, I didn`t know the exactly reason. My understanding is that it is optional. You can see reference link 4# for the further information.
and you can go on with filtering, variable features, scaling and PCA, etc and can even plot genes now that we have them in the row names.
--I am not sure that my understanding is correct. Do you mean that I didn`t need to normalize the data by “NormalizeData” and also not necessary to overwrite the normalized and log-transformed counts into BoM_SO.filter@assays$RNA@data by "BoM_SO.filter@assays$RNA@data = as.matrix(data)"? If so, how can the Seurat know where was the normalized and log-transformed counts?
Hi, here some answers.
This is alright, but in your code you filter the
Seurat
object first, and then you assign the unfiltered (or pre-filtered) normalized count matrix that you downloaded from GEO to thedata
slot. This means that there may be some inconsistencies between the number of cells and/or genes in the object (weirdly enough thoughSeurat
does not throw an error), and what is in the normalized counts you supply. For this reason I don't think that operation makes sense. Since you know that the filtering was already done, you can create an object without any filtering and assign the matrix you downloaded to thedata
slot.In the context of that GitHub issue reply, the user is suggesting that you store the
Seurat
-normalized data in themisc
slot before you add your own normalization to thedata
slot. This means that if you want to e.g. compare the result ofNormalizeData()
with another normalization method, you can keep both of them in the same object. This does not apply to your case for two reasons: 1) you do not have the result ofNormalizeData()
because you did not supply raw counts to the object; you only have log-transformed, probably normalized counts, nothing to compare it to. 2) in your code you are first assigning the log counts to thedata
slot, and then copying them to themisc
slot. So, even if you had the "original" result ofNormalizeData()
in thedata
slot, you would first be overwriting it with the external data and then copying it. This is why it does not make sense.exactly; if you supply log-counts to the
data
slot of theSeurat
object, you don't need to normalize them. Seurat uses the data in thedata
slot for HVG selection and scaling, so it already "knows" where to find it.An answer from your #4 link mentions an important thing to take into account: the base of the logarithm used for log transformation matters. If the log counts you download from GEO are in natural log, then it's ok; otherwise you would need to re-transform them.
Hi Giuseppe,
Thank you for your replying. Just one more question to make sure
--Do you mean that I still need to overwrite the BoM_SO.filter@assays$RNA@data by BoM_SO.filter@assays$RNA@data = as.matrix(data)? I understood what you said above was that I need to supply the log-counts to the data slot of the Seurat object. If so, the result of my Umap plot was still abnormal just like before.
If not, how do the Seurat know what I supply to it is log-counts when constructing Seurat object? And if skipping the normalization step, what is the data in data slot? Do the Seurat pass the data in counts slot to the data slot if skipping the normalization step?
I tested it and found that the input counts matrix will be writed into both counts and data slot regardless of the format of the input counts matrix (raw counts or log-counts) when constructing seurat object. The normalization step will normalized the matrix in data slot and overwrite it by normalized and log transformed matrix when executing “NormalizeData”. It seemed ok when supply log-counts to construct Seurat object and do not execute “NormalizeData”.
However, the final Umap plot was still abnormal when executing “BoM_SO.filter@assays$RNA@data = data” where data was a sparse matrix. As far as I know, it should be the same as without it. Do you know the reason?
Hi, like I mentioned before, you should not include the
NormalizeData()
step because you are already supplying normalized counts.Regarding the final UMAP plot, did you include the step where you remove the character column (I assume you did because you are transforming it to a sparse matrix)? Anyway, if you follow the code in my initial answer you should be alright; messing too much with the inputs and slots of an object is anyway never a good idea, since you would be changing the inputs that functions expect and will generate unpredictable or implausible results.