Question: Fixing batch effect with seurat integrate
gravatar for rmf
12 months ago by
rmf940 wrote:

I have three 10X runs. If I merge them with CreateSeuratObject(), this is the distribution of genes detected for each run (batch) (nFeatures_RNA).

enter image description here

If I integrate them, using hvg approach or sctransform approach. (Showing sctransform based integration below)

su.list <- SplitObject(su,"orig.ident")
for (i in 1:length(su.list)) {
    su.list[[i]] <- SCTransform(su.list[[i]],return.only.var.genes=F,"mito_percent","S.Score","G2M.Score"))
su.features <- SelectIntegrationFeatures(su.list,nfeatures=3000)
su.list <- Seurat::PrepSCTIntegration(su.list,anchor.features=su.features)
su.anchors <- FindIntegrationAnchors(object.list=su.list,normalization.method="SCT",anchor.features=su.features)
su <- IntegrateData(anchorset=su.anchors,normalization.method="SCT")

I get this gene distribution for each run. Number of genes have now been recalculated based on the 'integrated' assay containing 3000 genes.

enter image description here

Now, is this a batch effect? How would I correct for this? Should I run sctransform() or NormalizeData() again using '' on 'integrated' assay? Similarily, how would I corrected for mito_percent, cell cycle etc in an integration workflow? Should I regress them before or after integration?


scrna.seq seurat single-cell • 2.1k views
ADD COMMENTlink modified 12 months ago • written 12 months ago by rmf940

Something is wrong. nFeatures_RNA does not get modified by either sctransform() or NormalizeData(). Additionally, notice that the number of genes drops for each of your batches. You are doing something else besides integration between the first and second plot. You should check that the size of your Seurat object is the same before and after. You should also post the exact commands you used.

ADD REPLYlink modified 12 months ago • written 12 months ago by igor11k

I also had the feeling something is wrong. Then, I have looked into it and I don't see anything obviously wrong. So, in the first plot, I use nFeatures_RNA, In the second plot, I manually recalculate num of genes detected from 'integrated' assay using colSums(GetAssayData(su,assay="integrated")>0). 'nFeatures_SCT' also does not have this change in distribution since RNA assay and SCT assay both have all the initial set of genes. 'Integrated' only retains 2000 genes by default. I have changed it to 3000 in my case.

I suspected something wrong in the sctransform based integration workflow, so I tried the hvg based integration workflow and the integrate assay looks similar in distribution.

So, during the integration step where 3000 cells are selected, somehow it must be picking cells with lower number of genes detected for sample 1 and 2.

ADD REPLYlink modified 12 months ago • written 12 months ago by rmf940

colSums(GetAssayData(su,assay="integrated")>0) does not give you number of detected genes. It gives you the number of detected genes that were used for the integrated analysis. In your case, batch 3 has higher expression of those genes.

ADD REPLYlink written 12 months ago by igor11k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1694 users visited in the last hour