Can I perform scRNAseq data integration without loosing features?
2
0
Entering edit mode
7 days ago
joker33 ▴ 60

Hi,

in the vignette for scRNAseq data integration using Seurat (https://satijalab.org/seurat/articles/integration_introduction.html), the most variable features, including 2000 or 3000 features are chosen for data integration of multiple samples, resulting in a combined data object.

The combined data object has only the 2000 or 3000 most variable features, which were chosen at the beginning of the workflow. All the other features are lost in the integrated data. Is there any option to integrate the data having all features in the integrated data object?

Thanks!

scRNAseq anchor integration Seurat features • 216 views
ADD COMMENT
2
Entering edit mode
7 days ago
igor 12k

You can use as few or as many features as you want. Keep in mind that the integration happens based on the reduced dimensions, so you are "losing" features either way.

ADD COMMENT
0
Entering edit mode

Thanks for your response. If I use SCT transformation, it seems like I cannot specify all features, as there is an error coming up during the workflow (see below). I think might be a transformation-specific issue. Any advice?

> data.list <- PrepSCTIntegration(object.list = data.list, anchor.features = rownames(data.list[[1]]))
  |                                                  | 0 % ~calculating  Error in scale.data[anchor.features, ] : subscript out of bounds
In addition: Warning message:
The following requested features are not present in any models: AADAC, AADAT, AASS, ABCA10, ABCA12, ABCA6, ABCA8, ABCA9, ABCB4, ABCB5, ABCC11, ABCC6, ABCC8, ABCC9, ABCG2, ABHD1, ABI3BP, ABLIM2, ABLIM3, ABO, AC004069.2, AC004381.6, AC004540.5, AC004791.2, AC005224.2, AC005301.9, AC006116.20, AC007161.5, AC007405.6, AC007880.1, AC007950.2, AC008268.1, AC008746.12, AC009961.3, AC022007.5, AC061992.2, AC068282.3, AC073115.7, AC074183.4, AC083843.1, AC093609.1, AC093642.3, AC093901.1, AC097713.4, AC104024.1, AC104653.1, AC116035.1, AC144831.3, AC241585.1, ACAN, ACKR1, ACKR2, ACKR4, ACOX2, ACRC, ACSM3, ACSM5, ACSS3, ACTR3C, ACVR1C, ACY1, ACY3, ADAM22, ADAM23, ADAM33, ADAMTS1, ADAMTS12, ADAMTS13, ADAMTS15, ADAMTS18, ADAMTS2, ADAMTS3, ADAMTS4, ADAMTS5, ADAMTS7, ADAMTS8, ADAMTS9, ADAMTSL1, ADAMTSL2, ADAMTSL3, ADCY1, ADCY2, ADCY5, ADCY6, ADCYAP1, ADCYAP1R1, ADD2, ADGRA2, ADGRA3, ADGRB1, ADGRB2, ADGRD1, ADGRF5, ADGRG6, ADGRL2, ADGRL3, ADGRL4, ADGRV1, ADH1B, ADH1C, ADIRF, ADM2, ADM5, ADORA1, ADPRH [... truncated]

Here is my code:

> data.list <- SplitObject(seurat, split.by = "patient")
> for (i in 1:length(data.list)){ data.list[[i]] <- SCTransform(data.list[[i]], method = "glmGamPoi") }
> data.list <- PrepSCTIntegration(object.list = data.list, anchor.features = rownames(seurat))
>
> immune.anchors <- FindIntegrationAnchors(object.list = data.list, anchor.features = rownames(seurat), normalization.method = "SCT")
>
> immune.combined <- IntegrateData(anchorset = immune.anchors, normalization.method = "SCT")
>  
> DefaultAssay(immune.combined) <- "integrated"

The error occurs after line 3

ADD REPLY
2
Entering edit mode
7 days ago
ATpoint 52k

I think you are missing the concept here. As igor says the integration runs on the reduced dimensions, most commonly PCA, not on a per-gene basis. And for this you only want to include the variable genes as all other genes do not any information. The integrated values are only useful for clustering and visualization which, again, is commonly performed on a reduced dimension level. For something like differential gene expression one would not use the integrated values as the integration precedure creates dependencies between the data points and might change magnitude and directions. There are many threads on this already, e.g. issues at Seurat Github or at section 13.8 in OSCA:

At this point, it is also tempting to use the corrected expression values for gene-based analyses like DE-based marker gene detection. This is not generally recommended as an arbitrary correction algorithm is not obliged to preserve the magnitude (or even direction) of differences in per-gene expression when attempting to align multiple batches. For example, cosine normalization in fastMNN() shrinks the magnitude of the expression values so that the computed log-fold changes have no obvious interpretation. Of greater concern is the possibility that the correction introduces artificial agreement across batches. (...)

In summary, for integration you should use the most variable genes which is the reason why this is the default in most workflows.

ADD COMMENT
0
Entering edit mode

Thank you very much for clarification and for providing some links for further reading. Your explanation is very helpful. As you indicated, the Seurat authors recommend using the uncorrected data and define blocking for the batch variable for differential expression analysis in the discussion section.

I was wondering, however, how you would approach any other analyses apart from DE-testing where you can not define blocking. What if you want to compare ssGSEA /GSEA signature enrichment scores, or if you want to use scRNAseq data to model the expression of a feature? Do we need to wait for better computational methods to allow proper batch correction of scRNAseq data? I've read that methods such as combat and other methods are not providing satisfactory results even for DE-analysis (https://www.sciencedirect.com/science/article/pii/S200103701930409X).

ADD REPLY

Login before adding your answer.

Traffic: 1596 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6