Question

GO Term Analysis in scRNASeq data with low number of genes expressed

1

Entering edit mode

21 months ago

rohitsatyam102 ▴ 850

Hi Everyone!!

Background: I got a unique single cell dataset produced in house, where there is low number of genes expressed (initial stage of developmental cycle). We have control and drug treatment scRNASeq samples and post differential expression analysis, I wish to perform GO-Term analysis. However, I am confused about how to create an appropriate background (Universe Gene-set) and an appropriate methodology to carry out GO Term analysis

Method 1 for Background creation

Find genes which are expressed even in 1 cell in both control and treatment and then take union of Gene IDs.

Method 2 for Background creation

Find genes which are expressed in either 10% of cells in Control or 10% of cells in treatment. For this I am trying to calculate proportion of cells expressing each gene in each sample i.e. in ctrl and treatment separately using this function:


per.gene.per.sample.pct <- function(sobj,sample_col){
  ## Making the table summarising number of cells in each sample
  i=sobj@meta.data %>% plyr::count(sample_col)
  rname=rownames(sobj[["RNA"]]@counts)
  ## Calculating the percentage of cells where the gene is expressed
  ## in each sample

  rowsum.exp <- function(x){
    rowSums2(sobj[,sobj@meta.data[,sample_col]==i[,1][x]][["RNA"]]@counts>0)/i[,2][x]}
  tt <- 1:nrow(i) %>% purrr::map(function(x) rowsum.exp(x)) 
  tt<- t(plyr::ldply(tt))
  rownames(tt) <- rname
  colnames(tt) <- i[,1]
  return(tt)
}

Besides, for GO Term analysis should I combine all Differential genes from all the clusters and then carry out GO term analysis or should I separately carry out GO Term analysis for each cluster.The aim is to find out eventually which pathways got perturbed due to treatment and not the cell types in single cell data.

PS: When I try Method 2 for background creation, It leaves me with very few genes as compared to number of DE genes that I found using Seurat!! I understand this since the DE genes are calculated on per cluster basis and every gene need not be necessarily expressed in every cluster. But then does it mean that I should make separate background geneset based on genes expressed in each cluster?

On the other hand, Method 1 gives me exorbitant number of genes and I think that's not ideal either.

scrnaseq Seurat • 562 views

ADD COMMENT • link updated 21 months ago by mark.ziemann ★ 1.9k • written 21 months ago by rohitsatyam102 ▴ 850

score 2 · Accepted Answer · 2022-07-11

I think method 1 should work just fine, but maybe you can set a cell number threshold. Eg must be expressed in at least 10 cells in a cell state. This way the number of detected genes should increase as more cells are added to the dataset. Let us know how many genes are detected with this approach and how many genes are in the background and foreground sets.

Besides, for GO Term analysis should I combine all Differential genes from all the clusters and then carry out GO term analysis or should I separately carry out GO Term analysis for each cluster.The aim is to find out eventually which pathways got perturbed due to treatment and not the cell types in single cell data.

I think treating the different cell states as separate "experiments" is typical. We have to be open to the idea that the different cell states will respond differently to the disease/stimulus.