Question

Can I perform scRNAseq data integration without loosing features?

0

Entering edit mode

2.8 years ago

joker33 ▴ 110

Hi,

in the vignette for scRNAseq data integration using Seurat (https://satijalab.org/seurat/articles/integration_introduction.html), the most variable features, including 2000 or 3000 features are chosen for data integration of multiple samples, resulting in a combined data object.

The combined data object has only the 2000 or 3000 most variable features, which were chosen at the beginning of the workflow. All the other features are lost in the integrated data. Is there any option to integrate the data having all features in the integrated data object?

Thanks!

scRNAseq anchor integration Seurat features • 3.8k views

ADD COMMENT • link 2.8 years ago by joker33 ▴ 110

score 7 · Accepted Answer · 2021-07-21

I think you are missing the concept here. As igor says the integration runs on the reduced dimensions, most commonly PCA, not on a per-gene basis. And for this you only want to include the variable genes as all other genes do not any information. The integrated values are only useful for clustering and visualization which, again, is commonly performed on a reduced dimension level. For something like differential gene expression one would not use the integrated values as the integration precedure creates dependencies between the data points and might change magnitude and directions. There are many threads on this already, e.g. issues at Seurat Github or at section 13.8 in OSCA:

At this point, it is also tempting to use the corrected expression values for gene-based analyses like DE-based marker gene detection. This is not generally recommended as an arbitrary correction algorithm is not obliged to preserve the magnitude (or even direction) of differences in per-gene expression when attempting to align multiple batches. For example, cosine normalization in fastMNN() shrinks the magnitude of the expression values so that the computed log-fold changes have no obvious interpretation. Of greater concern is the possibility that the correction introduces artificial agreement across batches. (...)

In summary, for integration you should use the most variable genes which is the reason why this is the default in most workflows.

Ram · Accepted Answer · 2021-07-21

2

Entering edit mode

2.8 years ago

igor 13k

You can use as few or as many features as you want. Keep in mind that the integration happens based on the reduced dimensions, so you are "losing" features either way.

ADD COMMENT • link 2.8 years ago by igor 13k

0

Entering edit mode

Thanks for your response. If I use SCT transformation, it seems like I cannot specify all features, as there is an error coming up during the workflow (see below). I think might be a transformation-specific issue. Any advice?

> data.list <- PrepSCTIntegration(object.list = data.list, anchor.features = rownames(data.list[[1]]))
  |                                                  | 0 % ~calculating  Error in scale.data[anchor.features, ] : subscript out of bounds
In addition: Warning message:
The following requested features are not present in any models: AADAC, AADAT, AASS, ABCA10, ABCA12, ABCA6, ABCA8, ABCA9, ABCB4, ABCB5, ABCC11, ABCC6, ABCC8, ABCC9, ABCG2, ABHD1, ABI3BP, ABLIM2, ABLIM3, ABO, AC004069.2, AC004381.6, AC004540.5, AC004791.2, AC005224.2, AC005301.9, AC006116.20, AC007161.5, AC007405.6, AC007880.1, AC007950.2, AC008268.1, AC008746.12, AC009961.3, AC022007.5, AC061992.2, AC068282.3, AC073115.7, AC074183.4, AC083843.1, AC093609.1, AC093642.3, AC093901.1, AC097713.4, AC104024.1, AC104653.1, AC116035.1, AC144831.3, AC241585.1, ACAN, ACKR1, ACKR2, ACKR4, ACOX2, ACRC, ACSM3, ACSM5, ACSS3, ACTR3C, ACVR1C, ACY1, ACY3, ADAM22, ADAM23, ADAM33, ADAMTS1, ADAMTS12, ADAMTS13, ADAMTS15, ADAMTS18, ADAMTS2, ADAMTS3, ADAMTS4, ADAMTS5, ADAMTS7, ADAMTS8, ADAMTS9, ADAMTSL1, ADAMTSL2, ADAMTSL3, ADCY1, ADCY2, ADCY5, ADCY6, ADCYAP1, ADCYAP1R1, ADD2, ADGRA2, ADGRA3, ADGRB1, ADGRB2, ADGRD1, ADGRF5, ADGRG6, ADGRL2, ADGRL3, ADGRL4, ADGRV1, ADH1B, ADH1C, ADIRF, ADM2, ADM5, ADORA1, ADPRH [... truncated]

Here is my code:

> data.list <- SplitObject(seurat, split.by = "patient")
> for (i in 1:length(data.list)){ data.list[[i]] <- SCTransform(data.list[[i]], method = "glmGamPoi") }
> data.list <- PrepSCTIntegration(object.list = data.list, anchor.features = rownames(seurat))
>
> immune.anchors <- FindIntegrationAnchors(object.list = data.list, anchor.features = rownames(seurat), normalization.method = "SCT")
>
> immune.combined <- IntegrateData(anchorset = immune.anchors, normalization.method = "SCT")
>  
> DefaultAssay(immune.combined) <- "integrated"

The error occurs after line 3

ADD REPLY • link updated 2.8 years ago by Ram 43k • written 2.8 years ago by joker33 ▴ 110