I'm still a bit new to computational analysis of single cell data, but I'm doing my best to understand why things are done.
As I understand it, when people cluster their data, typically they do some feature selection, usually taking the most variable genes across the entire dataset or through iteratively subsetting their data and doing this with each subset. The expression of these variable genes is then fit to a negative binomial distribution to estimate scaled expression values, which will then be fed into dimensional reduction and/or clustering algorithms. I'm having a difficult time trying to understand what the purpose of fitting to a negative binomial distribution is. Is it that this takes into account relative abundances better? Please tell me if I'm on the right track or way off: Say gene A is expressed lowly in most cells -- it's 1 or 2 copies in some cells, but relatively highly at 5 copies in a few cells. Gene B is comparatively expressed much higher -- several cells express it at 10-20 copies, while other cells express it relatively highly at 50 copies per cell.
So this fitting to a negative binomial distribution in essence helps take into account the nature of the expression of Gene X to provide a normalized, scaled, and centered value of 2 or 3 for both of these genes, despite the differences in overall expression? And its fit to a negative binomial distribution because gene expression follows this distribution? I've heard this but don't know what paper showed this.
I'd appreciate any explanations or links that might clarify this more.