Question: Fixing batch effect with seurat integrate
gravatar for rmf
4 weeks ago by
rmf700 wrote:

I have three 10X runs. If I merge them with CreateSeuratObject(), this is the distribution of genes detected for each run (batch) (nFeatures_RNA).

enter image description here

If I integrate them, using hvg approach or sctransform approach. (Showing sctransform based integration below)

su.list <- SplitObject(su,"orig.ident")
for (i in 1:length(su.list)) {
    su.list[[i]] <- SCTransform(su.list[[i]],return.only.var.genes=F,"mito_percent","S.Score","G2M.Score"))
su.features <- SelectIntegrationFeatures(su.list,nfeatures=3000)
su.list <- Seurat::PrepSCTIntegration(su.list,anchor.features=su.features)
su.anchors <- FindIntegrationAnchors(object.list=su.list,normalization.method="SCT",anchor.features=su.features)
su <- IntegrateData(anchorset=su.anchors,normalization.method="SCT")

I get this gene distribution for each run. Number of genes have now been recalculated based on the 'integrated' assay containing 3000 genes.

enter image description here

Now, is this a batch effect? How would I correct for this? Should I run sctransform() or NormalizeData() again using '' on 'integrated' assay? Similarily, how would I corrected for mito_percent, cell cycle etc in an integration workflow? Should I regress them before or after integration?


scrna.seq seurat single-cell • 189 views
ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by rmf700

Something is wrong. nFeatures_RNA does not get modified by either sctransform() or NormalizeData(). Additionally, notice that the number of genes drops for each of your batches. You are doing something else besides integration between the first and second plot. You should check that the size of your Seurat object is the same before and after. You should also post the exact commands you used.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by igor8.1k

I also had the feeling something is wrong. Then, I have looked into it and I don't see anything obviously wrong. So, in the first plot, I use nFeatures_RNA, In the second plot, I manually recalculate num of genes detected from 'integrated' assay using colSums(GetAssayData(su,assay="integrated")>0). 'nFeatures_SCT' also does not have this change in distribution since RNA assay and SCT assay both have all the initial set of genes. 'Integrated' only retains 2000 genes by default. I have changed it to 3000 in my case.

I suspected something wrong in the sctransform based integration workflow, so I tried the hvg based integration workflow and the integrate assay looks similar in distribution.

So, during the integration step where 3000 cells are selected, somehow it must be picking cells with lower number of genes detected for sample 1 and 2.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by rmf700

colSums(GetAssayData(su,assay="integrated")>0) does not give you number of detected genes. It gives you the number of detected genes that were used for the integrated analysis. In your case, batch 3 has higher expression of those genes.

ADD REPLYlink written 4 weeks ago by igor8.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1797 users visited in the last hour