Question

Redundant Genes in scRNA-seq Subclusters

0

Entering edit mode

3 hours ago

s • 0

Hi,

After annotation all my major cell types, I subsetted each subset of interest further to tease out more nuanced annotations. So for example, my T cell subset had 11 subclusters. I then found patient wise DEGs (condition vs control), in each of these subsets. Patient wise since we were interested in looking at the samples itself, but each patient contributed only one sample:

subset_object <- readRDS("subset.RDS")

DefaultAssay(subset_object) <- "RNA"
Idents(subset_object) <- "seurat_clusters"


subset_object$subcluster <- Idents(subset_object)
subset_object <- JoinLayers(subset_object)
markers_list <- list()

# Loop over each subcluster
for (cluster in unique(subset_object$subcluster)) {
  message(paste0("Processing cluster: ", cluster))


  subcluster_data <- subset(subset_object, subset = subcluster == cluster)
  Idents(subcluster_data) <- "category"

  nv_cells <- WhichCells(subcluster_data, idents = "NV")

  for (patient in setdiff(unique(subcluster_data$category), "NV")) {
    message(paste0("  Comparing ", patient, " vs NV"))


    patient_cells <- WhichCells(subcluster_data, idents = patient)

    if (length(patient_cells) < 5 | length(nv_cells) < 5) {
      message(paste0("    Skipping ", patient, " due to low cell count"))
      next
    }

    markers <- FindMarkers(
      subcluster_data,
      ident.1 = patient_cells,
      ident.2 = nv_cells,
      logfc.threshold = 0.25,
      min.pct = 0.25,
      only.pos = FALSE
    )

    markers$gene <- rownames(markers)   
    markers$cluster <- cluster          

    key <- paste0("subset_cluster_", cluster, "_", patient, "_vs_NV")
    markers_list[[key]] <- markers
  }
}

# Save results to CSV
for (name in names(markers_list)) {
  write.csv(markers_list[[name]], file = paste0("csvs/", name, ".csv"))
}

The code works perfectly okay, but my issue is that the results from these DEGs are not unique to that specific subcluster. So for almost all 12 subclusters in the T cell subset, Patient1 will have almost the same top few DEGs. I understand overclustering can add to this problem, but we are quite confident in our annotations, and regardless at least some clusters should have different DEGs then.

I am unsure how to tackle this problem because it seems like these redundant DEGs are not at all related to which subcluster the analysis is being done in. Would appreciate any insight thanks!

immunology single cell • 51 views

ADD COMMENT • link updated 25 minutes ago by LChart 5.1k • written 3 hours ago by s • 0

score 0 · Answer 1 · 2025-10-22

What you're describing sounds less like a technical problem, and more like a mismatch between the data and your expectations. Based on at least some clusters should have different DEGs then, it seems that you are expecting each sub-cluster to have its own set of patient-specific DE genes. However, if a patient shows over-expression of (say) IL2RA in T cells generally, I would expect that to be "inherited" by all T-cell subtypes.

I would take the DE genes and generate an UpSet plot. Perhaps you have already done this and found that all of the genes are shared between 10+ subtypes - or maybe there are a few that are subtype specific.

Depending on the nature of the shared genes, this could also be a simple RNA quality, or sample handling artifact. The top genes may make sense (highly expressed, with high-effect eQTLs : very believable) or they may not (mitochondrial genes, MALAT1, snoRNA : suspicious). The more they are shared with pseudobulk DE (all cells), the more skeptical I would be about their veracity (though, obviously, an eQTL for a ubiquitously-expressed gene could result in this observation).