I have the single-nuclei 10X genomics datasets from different regions of the human brain and spinal cord. I've got some questions: First, I could not find any recent paper to have an idea what is the proportion of different cell types in these different regions (oculomotor nucleus, medulla, anterior horn of spinal cord lumbar, and cervical cross-section). I appreciate it if anyone can help me with any reference explaining cell types composition and proportion especially single-cell reference. Second, when it comes to clustering I have different clusters with several gene markers. I am kind of confused about how should I assign those clusters. I increased the clustering resolution to properly subset the clusters but it did not make a difference. Then, I included more PCs in case I had used too few principal components such that I am just not separating out these cell types of interest. it also did not work. Moreover, I am wondering how flexible shall I be in cluster assignation; if some of the cell-type markers are not expressed in a cluster could I assign the cluster based on other markers? or it is okay if I have different gene markers and I should consider that I have most of the specific cell-type markers with higher expression compared to other unspecific markers? Third, considering that I have single-nuclei data set should I consider the maximum cut off in QC step? I have even 7000 genes per nuclei and more. First I tried not to be strict and specified 7000 as the maximum number of genes but I am not sure that it is expected that many genes in nuclei. Any help greatly be appreciated
That's quite the list of questions. Please consider breaking future posts up into multiple if they contain more than 2-3 questions. I had to shorten my answer to answer everything in a single post.
Datasets & Cell Proportions
There are a fair number of brain datasets out there - the Allen Brain Map is huge, the UCSC Cell Browser contains several, and the Broad Single Cell portal contains many others. Not to mention GEO and other data repositories. Despite those sources, you may not find a comparable dataset for your favorite tissue. And even if you do, single cell technologies are imprecise tools for looking at cell type composition between samples due to sampling bias and technical variation. Of course, previous studies would at least give you an idea of what to expect in terms of cell proportions, so your idea to look for them isn't a bad one. If you can't find a comparable single cell dataset, you might consider looking for flow cytometry for your tissue of interest. Just because they don't use the latest cutting edge technology doesn't mean their results can't help you interpret your data.
Clustering & Cell Annotation
Okay, so this is where single cell analysis starts to rear its head. First off, conventional cell markers as used for flow/IHC don't always translate well to single cell analysis simply because RNA =/= protein. In addition, certain genes are seemingly more susceptible to drop out than others. CD4 is a notable example of a great surface marker that is more or less useless as a marker in single cell RNA-seq. This highlights the risk of attempting to use only one (or two or three) markers to assign a cell type, as it's quite common for markers to bleed over between clusters as you've observed yourself. This isn't to say that you can't annotate your clusters/cells manually, just that it isn't always so simple as most published papers make it seem with their clean, clearly defined clusters. Between you and me, plenty of published works have annotations that are mildly (or sometimes wildly) inaccurate, particularly for less well-defined cellular subtypes. It takes domain expertise and quite a bit of swimming in the data to properly annotate cells with only a few markers and even then it can be subjective.
A more robust, and generally easier, method is to utilize previously published datasets as references. This can be as simple as scoring each cell for cell type-specific gene signatures derived from previous bulk or scRNA-seq studies. Preferably from more than one of them. Generally, these use many more markers and make it more obvious as to the dominant cell types within each cluster. There are also correlation-based approaches that use reference datasets containing purified bulk or annotated single cell RNA-seq data to automatically assign a cell type to each cell based on similarity to the reference dataset cell types. SingleR is one such method that takes this approach.
There are no hard and fast guidelines for QC filtering, as what's appropriate can vary drastically between cell types and runs. Again, there are multiple methods for this. You've tried the most simple - fixed, arbitrarily chosen thresholds. This approach can work fine, but it's also risky. At minimum, it has to be adjusted on an experiment by experiment (or even sample by sample) basis, which is inconvenient, time-consuming, and subjective. At worst, it can flat out remove biologically interesting populations that tend to express more genes or have a few more mitochondrial reads.
A more nuanced approach is to use adaptive thresholds that identify outliers for each QC metric, with the assumption that most cells are healthy. Even this approach can be biased for certain subpopulations that biologically tend to have fewer/greater numbers of cell expressed, higher mitochondrial read %, etc. Figure 3 from the pipeComp paper has excellent examples of this in real data. In theory, the safest approach would be to use a quick clustering method and use adaptive thresholds for each cluster, and then remove any clusters of blatantly dead/dying cells. But depending on the data, such an approach may well be unnecessary.
In addition, if you are not performing any type of doublet removal, I highly recommend doing so.
Lastly, you appear to be a newcomer to single-cell RNA-seq, and I can't recommend the Orchestrating Single Cell Analysis book enough. It is a veritable gold mine of information and code examples from true experts. Though it uses packages from Bioconductor for analysis, the information is applicable to other packages (Seurat, scanpy, etc) just as well.