Question

The purpose of scaling to a negative binomial distribution in single cell RNA-seq

1

Entering edit mode

6.2 years ago

ericvaughn11 ▴ 10

Hi,

I'm still a bit new to computational analysis of single cell data, but I'm doing my best to understand why things are done.

As I understand it, when people cluster their data, typically they do some feature selection, usually taking the most variable genes across the entire dataset or through iteratively subsetting their data and doing this with each subset. The expression of these variable genes is then fit to a negative binomial distribution to estimate scaled expression values, which will then be fed into dimensional reduction and/or clustering algorithms.

I'm having a difficult time trying to understand what the purpose of fitting to a negative binomial distribution is. Is it that this takes into account relative abundances better? Please tell me if I'm on the right track or way off:

Say gene A is expressed lowly in most cells -- it's 1 or 2 copies in some cells, but relatively highly at 5 copies in a few cells. Gene B is comparatively expressed much higher -- several cells express it at 10-20 copies, while other cells express it relatively highly at 50 copies per cell.

So this fitting to a negative binomial distribution in essence helps take into account the nature of the expression of Gene X to provide a normalized, scaled, and centered value of 2 or 3 for both of these genes, despite the differences in overall expression? And its fit to a negative binomial distribution because gene expression follows this distribution? I've heard this but don't know what paper showed this.

I'd appreciate any explanations or links that might clarify this more.

Thanks,
Eric

RNA-Seq scRNA-Seq • 5.3k views

ADD COMMENT • link updated 3 months ago by Ram 43k • written 6.2 years ago by ericvaughn11 ▴ 10

Ram · Answer 1 · 2018-03-21

4

Entering edit mode

6.2 years ago

Kevin Blighe 88k

Yes, the fact of the matter is that RNA-seq count data naturally follows a negative binomial distribution, so, one has to model the data as such if one wishes to derive statistics from it. Original analysis methods modeled it as a Poisson but it was found that this still resulted in false-positive associations after having fitted the model. Microarray gene expression data, on the other hand, follows a normal distribution.

Practically, what does fitting a model actually mean? - What happens is that we literally create a logistic regression model of the data and specify the negative binomial as the family. In pseudocode:

glm(outcome ~ gene1 + CounfoundingFactors, family="NegativeBinomial")

That is fit for all genes. Other parameters are added for the purposes of nomalisation and dispersion.

With a model of the RNA-seq data, we can then make statistical inferences.

Some of your other questions are specific to the normalisation method that one actually uses. Typically, what are known as size factors are calculated, which will cater for the scenario that you mentioned.

Kevin

ADD COMMENT • link updated 3 months ago by Ram 43k • written 6.2 years ago by Kevin Blighe 88k

0

Entering edit mode

what is the mechanistic explanation of why negative binomial is physically plausible? for binomial processes (negative or not), we have the notion of (a) a success rate, (b) the number of trials, and (c) the number of successes. In RNA seq, what are the trials and successes? Just wondering how, when negative binomial is giving us a probability dist of "number of trials given a number of successes," what are those quantities biologically? eg, is a gene read a "success"?

ADD REPLY • link 2.5 years ago by willy ▴ 10

0

Entering edit mode

i ask this because the math here is easy to get, but why this model makes sense seems totally arbitrary and lost on me lol

ADD REPLY • link 2.5 years ago by willy ▴ 10

1

Entering edit mode

I would ask this to a statistician

ADD REPLY • link 2.5 years ago by Kevin Blighe 88k

0

Entering edit mode

I think I have a better grasp of why negative binomial is used for modeling true counts, after a few nights sleep.

Assume that scRNA-seq reads are a Bernoulli process with probability p. Specifically, each transcript in a cell is like a trial, and each read is a success of that trial. And these successes happen with some probability p. Then, the number of reads k is binomially distributed by k ~ Bin(n, p). We're interested in the inverse, i.e., the distribution of n given k and p. This is precisely the negative binomial distribution: n ~ NegBin(k, p).

I think this is the correct way to motivate the negative binomial. If anyone can correct me, please do!!

My only additional question, then, is how do we know what p is? Is it inferred from gene length and sequencing depth? I'm not sure.

ADD REPLY • link 2.5 years ago by willy ▴ 10

score 0 · Answer 2 · 2024-01-19

I highly recommend reading this paper that does into details on "why this model makes sense".

Basically, In scRNA-seq counts there are two layers of uncertainty (1) a technical "measurement" process (which can be modeled with Poisson distribution) (2) and a biological "expression" process (which can be modeled with a gamma distribution). The Negative Binomial is the result of combining both the Poisson and gamma layers of uncertainty into one observation model.

From the authors:

observed scRNA-seq counts reflect two distinct factors: the variation in actual expression levels among cells, and the imperfect measurement process. Therefore, models for observed scRNA-seq counts, which we will call observation models, are obtained by specifying: (1) an expression model that describes how the true expression levels vary among cells/genes, and (2) a measurement model that describes how observed counts deviate from the true expression levels.