Confusion about FindMarkers(), FindVariableFeatures(), RunTSNE(), and RunUMAP() in seurat package
1
2
Entering edit mode
18 months ago
Farah ▴ 70

Hello,

I am using Seurat package for the analysis of my single-cell RNA-seq. I read their basic PBMCs tutorial https://satijalab.org/seurat/v3.1/pbmc3k_tutorial.html. However, I get confused about some steps. I would highly appreciate having your answers on my following questions:

1) Is FindVariableFeatures() working on Normalized-Data or on Scaled_Data?

As in the seurat tutorial it is after NormalizeData() and before ScaleData(), but in kallisto | bustools tutorial (file://filestore.soton.ac.uk/users/fsgh1d18/mydesktop/Kallisto_bustools/BUS_notebooks_R-master/docs/10xv2.html) it is after both NormalizeData() and ScaleData().

2) Is FindMarkers() working on Normalized_Data or Scaled_Data or clusters obtained by RunTSNE()/RunUMAP()?

As FindMarkers() finds marker genes of the clusters, so I assume that the input data for FindMarkers() should be clusters obtained by RunTSNE()/RunUMAP() which these functions are also using scaled_data or selected_PCAs (obtained from RunPCA function), see below. Therefore, in practice, FindMarkers() is also using Scaled_Data, while it should work on Normalized_Data.

RunPCA() on Scaled_Data --> Selection of PCAs --> RunTSNE()/RunUMAP() --> FindMarkers()

3) If FindMarkers() works on Normalized_Data (rather than Scaled_Data), how it can be explained that FindMarkers() has inputs of clusters obtained by RunTSNE()/RunUMAP() which these functions are also using scaled_data (not Normalized_Data)?

Many thanks for any help and clarifications.

Seurat scRNA-seq kallisto|bustools • 4.1k views
9
Entering edit mode
18 months ago

You appear to have a few misconceptions about what Seurat is doing for each command. I will do my best to address them.

1. Neither, it's just using the counts. This step is just identifying the most variables genes to help limit the non-sparse matrix returned by ScaleData, as it is very memory intensive and will bloat the object greatly if all genes are used. In addition, the top few thousand most variable genes have been shown to be plenty sufficient for marker identification.

2. RunTSNE and RunUMAP do not perform clustering, that is performed by FindClusters. FindMarkers uses the data slot by default (normalized counts) - it would make no sense to use the scaled.data slot, as those are basically just z-scores (or residuals if using SCTransform or one of the integration methods). Depending on the test used, it may make more sense to use the counts slot at times.

3. Again, these methods are not performing clustering. Clustering is performed by FindClusters after constructing a shared nearest neighbor graph on the output of RunPCA via FindNeighbors, which uses the PCA embeddings to determine similarities between cells. The clustering function then groups cells based on these similarities into clusters with an adjustable resolution that defines how granular the distinctions between clusters should be.

I highly recommend reading the manual for the commands in question to better understand what they're doing.

0
Entering edit mode

Many thanks for your helpful explanations. I got the points. Thanks a lot.

0
Entering edit mode

0
Entering edit mode

Hi Jared.

Just one more about the second point of your answer.

it would make no sense to use the scaled.data slot, as those are basically just z-scores (or residuals if using SCTransform or one of the integration methods). Depending on the test used, it may make more sense to use the counts slot at times.

What I understand was SCTransform will only calculate residuals and we can access it by GetAssayData(obj, slot="scale.data"). If I understand you correctly, the value of GetAssayData(obj, slot="data") is also calculated by SCTransform and such value is done by NormalizeData() in old Seurat. So is SCTransform's GetAssayData(obj, slot="data") == NormalizeData(obj)?

Since Seurat is under development continuously and there is always an 'A-ha' monent.

0
Entering edit mode

I do not know. I expect they are similar, but I don't know enough about the internals of the SCTranscform function to say that they should be identical. Easy enough to test yourself on a toy dataset though.

0
Entering edit mode

Just double-checked.

RNA

• counts: Stores unnormalized data such as raw counts or TPMs
• data: Normalized data matrix
• scale.data: Scaled data matrix

SCT

• counts: corrected counts
• data: log1p(counts)
• scale.data: pearson residuals

So SCTransform's GetAssayData(obj, slot="data") is not equal to RNA's NormalizeData(obj).

But another big issue I can not make clear is when we should which assay or slot even I spent more than 3 hours on this. Especially for FindMarkers.

0
Entering edit mode

I am not sure if this will help but this is my current understanding. I have recently started learning about the Single Cell analysis and here it is.

1. If you follow the general integration method of Seurat given here. For Findmarkers function use RNA assay: reason1 reason2

2. For SCTtransformed data, you can again use RNA assay or SCT normalized values as recommended in this thread

0
Entering edit mode

They contradict point 2 in numerous other issues. It has been nearly 2 years since they introduced that function and it's still not clear how/if SCT normalized values are truly appropriate for DE in all cases or not.

The answer to "can I use the SCT assay with typical DE function available in Seurat?" from the Satija lab seems to be something along the lines of "Maybe. Probably. Sometimes. Sometimes probably not."

On the other hand, the sctransform author (the package, not the Seurat SCTransform function) has provided a way to do DE on sctransform'ed matrices (including Seurat SCT assays), so who knows at this point. See the vignette.