Hi,
After annotation all my major cell types, I subsetted each subset of interest further to tease out more nuanced annotations. So for example, my T cell subset had 11 subclusters. I then found patient wise DEGs (condition vs control), in each of these subsets. Patient wise since we were interested in looking at the samples itself, but each patient contributed only one sample:
subset_object <- readRDS("subset.RDS")
DefaultAssay(subset_object) <- "RNA"
Idents(subset_object) <- "seurat_clusters"
subset_object$subcluster <- Idents(subset_object)
subset_object <- JoinLayers(subset_object)
markers_list <- list()
# Loop over each subcluster
for (cluster in unique(subset_object$subcluster)) {
message(paste0("Processing cluster: ", cluster))
subcluster_data <- subset(subset_object, subset = subcluster == cluster)
Idents(subcluster_data) <- "category"
nv_cells <- WhichCells(subcluster_data, idents = "NV")
for (patient in setdiff(unique(subcluster_data$category), "NV")) {
message(paste0(" Comparing ", patient, " vs NV"))
patient_cells <- WhichCells(subcluster_data, idents = patient)
if (length(patient_cells) < 5 | length(nv_cells) < 5) {
message(paste0(" Skipping ", patient, " due to low cell count"))
next
}
markers <- FindMarkers(
subcluster_data,
ident.1 = patient_cells,
ident.2 = nv_cells,
logfc.threshold = 0.25,
min.pct = 0.25,
only.pos = FALSE
)
markers$gene <- rownames(markers)
markers$cluster <- cluster
key <- paste0("subset_cluster_", cluster, "_", patient, "_vs_NV")
markers_list[[key]] <- markers
}
}
# Save results to CSV
for (name in names(markers_list)) {
write.csv(markers_list[[name]], file = paste0("csvs/", name, ".csv"))
}
The code works perfectly okay, but my issue is that the results from these DEGs are not unique to that specific subcluster. So for almost all 12 subclusters in the T cell subset, Patient1 will have almost the same top few DEGs. I understand overclustering can add to this problem, but we are quite confident in our annotations, and regardless at least some clusters should have different DEGs then.
I am unsure how to tackle this problem because it seems like these redundant DEGs are not at all related to which subcluster the analysis is being done in. Would appreciate any insight thanks!
Thank you so much for your answer! I am not familiar with an UpSet plot, would you have a vignette that you would recommend for this? I did use SoupX and scDblFinder for ambient RNA and doublet QC at the all cell types level (before any subsetting).
Additionally, I did not pseudobulk while finding DEGs as you can see from my code. Would you recommend pseudobulking? And unfortunately again I do not know much about eQTL for genes, would you have a resource I could look at?
Thanks again!
Unless you have replicate runs for the same sample, you won't be able to pseudobulk (since you'll be comparing k=1 to k=(N-1)). If you do have replicates, I do recommend pseudobulking, as it tends to be more conservative.
The UpSet plot is fairly straightforward: