Question

Redundant Genes in scRNA-seq Subclusters

0

Entering edit mode

4 days ago

s • 0

Hi,

After annotation all my major cell types, I subsetted each subset of interest further to tease out more nuanced annotations. So for example, my T cell subset had 11 subclusters. I then found patient wise DEGs (condition vs control), in each of these subsets. Patient wise since we were interested in looking at the samples itself, but each patient contributed only one sample:

subset_object <- readRDS("subset.RDS")

DefaultAssay(subset_object) <- "RNA"
Idents(subset_object) <- "seurat_clusters"


subset_object$subcluster <- Idents(subset_object)
subset_object <- JoinLayers(subset_object)
markers_list <- list()

# Loop over each subcluster
for (cluster in unique(subset_object$subcluster)) {
  message(paste0("Processing cluster: ", cluster))


  subcluster_data <- subset(subset_object, subset = subcluster == cluster)
  Idents(subcluster_data) <- "category"

  nv_cells <- WhichCells(subcluster_data, idents = "NV")

  for (patient in setdiff(unique(subcluster_data$category), "NV")) {
    message(paste0("  Comparing ", patient, " vs NV"))


    patient_cells <- WhichCells(subcluster_data, idents = patient)

    if (length(patient_cells) < 5 | length(nv_cells) < 5) {
      message(paste0("    Skipping ", patient, " due to low cell count"))
      next
    }

    markers <- FindMarkers(
      subcluster_data,
      ident.1 = patient_cells,
      ident.2 = nv_cells,
      logfc.threshold = 0.25,
      min.pct = 0.25,
      only.pos = FALSE
    )

    markers$gene <- rownames(markers)   
    markers$cluster <- cluster          

    key <- paste0("subset_cluster_", cluster, "_", patient, "_vs_NV")
    markers_list[[key]] <- markers
  }
}

# Save results to CSV
for (name in names(markers_list)) {
  write.csv(markers_list[[name]], file = paste0("csvs/", name, ".csv"))
}

The code works perfectly okay, but my issue is that the results from these DEGs are not unique to that specific subcluster. So for almost all 12 subclusters in the T cell subset, Patient1 will have almost the same top few DEGs. I understand overclustering can add to this problem, but we are quite confident in our annotations, and regardless at least some clusters should have different DEGs then.

I am unsure how to tackle this problem because it seems like these redundant DEGs are not at all related to which subcluster the analysis is being done in. Would appreciate any insight thanks!

immunology single cell • 386 views

ADD COMMENT • link updated 3 days ago by LChart 5.1k • written 4 days ago by s • 0

score 1 · Answer 1 · 2025-10-22

1

Entering edit mode

4 days ago

LChart 5.1k

What you're describing sounds less like a technical problem, and more like a mismatch between the data and your expectations. Based on at least some clusters should have different DEGs then, it seems that you are expecting each sub-cluster to have its own set of patient-specific DE genes. However, if a patient shows over-expression of (say) IL2RA in T cells generally, I would expect that to be "inherited" by all T-cell subtypes.

I would take the DE genes and generate an UpSet plot. Perhaps you have already done this and found that all of the genes are shared between 10+ subtypes - or maybe there are a few that are subtype specific.

Depending on the nature of the shared genes, this could also be a simple RNA quality, or sample handling artifact. The top genes may make sense (highly expressed, with high-effect eQTLs : very believable) or they may not (mitochondrial genes, MALAT1, snoRNA : suspicious). The more they are shared with pseudobulk DE (all cells), the more skeptical I would be about their veracity (though, obviously, an eQTL for a ubiquitously-expressed gene could result in this observation).

ADD COMMENT • link 4 days ago by LChart 5.1k

0

Entering edit mode

Thank you so much for your answer! I am not familiar with an UpSet plot, would you have a vignette that you would recommend for this? I did use SoupX and scDblFinder for ambient RNA and doublet QC at the all cell types level (before any subsetting).

Additionally, I did not pseudobulk while finding DEGs as you can see from my code. Would you recommend pseudobulking? And unfortunately again I do not know much about eQTL for genes, would you have a resource I could look at?

Thanks again!

ADD REPLY • link 3 days ago by s • 0

0

Entering edit mode

Unless you have replicate runs for the same sample, you won't be able to pseudobulk (since you'll be comparing k=1 to k=(N-1)). If you do have replicates, I do recommend pseudobulking, as it tends to be more conservative.

The UpSet plot is fairly straightforward:

library(UpSetR)
upset(
  fromList(
    list(
      set1=sample.int(100, 50),
      set2=sample.int(100, 20),
      set3=sample.int(100, 40)
    )
  )
)

enter image description here

ADD REPLY • link 3 days ago by LChart 5.1k