Question

Fixing batch effect with seurat integrate

0

Entering edit mode

6.0 years ago

firestar ★ 1.7k

I have three 10X runs. If I merge them with CreateSeuratObject(), this is the distribution of genes detected for each run (batch) (nFeatures_RNA).

enter image description here

If I integrate them, using hvg approach or sctransform approach. (Showing sctransform based integration below)

library(sctransform)
su.list <- SplitObject(su,split.by="orig.ident")
for (i in 1:length(su.list)) {
    su.list[[i]] <- SCTransform(su.list[[i]],return.only.var.genes=F,vars.to.regress=c("mito_percent","S.Score","G2M.Score"))
}
su.features <- SelectIntegrationFeatures(su.list,nfeatures=3000)
su.list <- Seurat::PrepSCTIntegration(su.list,anchor.features=su.features)
su.anchors <- FindIntegrationAnchors(object.list=su.list,normalization.method="SCT",anchor.features=su.features)
su <- IntegrateData(anchorset=su.anchors,normalization.method="SCT")

I get this gene distribution for each run. Number of genes have now been recalculated based on the 'integrated' assay containing 3000 genes.

enter image description here

Now, is this a batch effect? How would I correct for this? Should I run sctransform() or NormalizeData() again using 'vars.to.regress' on 'integrated' assay? Similarily, how would I corrected for mito_percent, cell cycle etc in an integration workflow? Should I regress them before or after integration?

Thanks.

scrna.seq seurat single-cell • 7.1k views

ADD COMMENT • link 6.0 years ago by firestar ★ 1.7k

0

Entering edit mode

Something is wrong. nFeatures_RNA does not get modified by either sctransform() or NormalizeData(). Additionally, notice that the number of genes drops for each of your batches. You are doing something else besides integration between the first and second plot. You should check that the size of your Seurat object is the same before and after. You should also post the exact commands you used.

ADD REPLY • link 6.0 years ago by igor 13k

0

Entering edit mode

I also had the feeling something is wrong. Then, I have looked into it and I don't see anything obviously wrong. So, in the first plot, I use nFeatures_RNA, In the second plot, I manually recalculate num of genes detected from 'integrated' assay using colSums(GetAssayData(su,assay="integrated")>0). 'nFeatures_SCT' also does not have this change in distribution since RNA assay and SCT assay both have all the initial set of genes. 'Integrated' only retains 2000 genes by default. I have changed it to 3000 in my case.

I suspected something wrong in the sctransform based integration workflow, so I tried the hvg based integration workflow and the integrate assay looks similar in distribution.

So, during the integration step where 3000 cells are selected, somehow it must be picking cells with lower number of genes detected for sample 1 and 2.

ADD REPLY • link 6.0 years ago by firestar ★ 1.7k

1

Entering edit mode

colSums(GetAssayData(su,assay="integrated")>0) does not give you number of detected genes. It gives you the number of detected genes that were used for the integrated analysis. In your case, batch 3 has higher expression of those genes.

ADD REPLY • link 6.0 years ago by igor 13k