Disclaimer 1: I also posted this in the EBSeq google group, but it does not seem to attract much attention.
Disclaimer 2: this is a long post, but I think it is well worth reading and answering to for people (like me) who really want to understand how EBSeq works, but did not get there because the mathematics in the paper are difficult, or for people who completely understand EBSeq and want to help people like me.
I have been struggling a lot to make sense of the methods described in the paper. Partly because I'm not a statistician, but also because the formulas in the paper come with little context. I have found a couple of questions ons this google group, like here and here, that did not really get an answer. What I will try to do here, is to get started with explaining the method in words. I will regularly bump into the limits of my understanding, and I hope that others will take it from there and answer my questions. Let's agree that "read the paper" or "read the supplementals" don't count as answers (I did read them) so we can end up with some kind of explanation that is accessible for people who don't read math like English.
So, the basic question we want to answer is: does the estimated value of a certain isoform change in function of some conditions, and is this statistically significant in the face of the estimated variance? This is equivalent to many more commonly used tests that we apply every day (like t-test, ANOVA etc), with the differences that here, we repeat this test numerous times, and we use a negative binomial (NB) distribution to model the distribution of read counts, rather than the usually more familiar normal.
NB distributions can be parameterized in many ways. For read counts, I have always found it most intuitive from the "over-dispersed Poisson" point of view: if all counts were just technical replicates from the same sample, and our methods would be "perfect", we would find a Poisson distribution, where the only source of variance would be 'shot noise'. By definition, the variance would equal the mean. However, we are not looking at mere technical replicates, but rather at independent biological samples, and we expect them to differ from each other. The extra "noise" introduced by these differences, will increase the variance of our counts. Also our methods are not perfect, so even with very little biological variation, we would find our variance to be larger than the mean. Mathematically, this system is described as a Poisson distribution lambda itself is gamma distributed.
The classical parametrization with two parameters r and p reads: when p is the probability of success in each experimental trial, and r is a pre-defined number of counted "failures" before the experiment is stopped, then the number of successes x is NB distributed x ~ NB(r,p). I have not yet figured out a way of translating this to counting reads from a certain gene among many other reads, and would be very thankful if somebody could show this, or tell me (why) it would not make sense...
You can use the following piece of R code to get a feel of the behavior of the functions. For example: in the gamma distribution, lowering the shape (r) for a fixed µ, will increase the variance. For the negative binomial, for a fixed µ, the p parameter is a function of this same shape (r). The higher r, the closer p gets to 1, and the lower the variance of the NB. For very large r (r>>µ), the gamma function becomes a very pointy spike around the expected value, and hence, the variance is merely Poisson variance. For very small r (r<<µ), p gets close to 0, and the gamma distribution becomes wide (and skewed) around the expected value. Here, the Poisson variance becomes almost trivial compared to the total variance.
# Some R code that shows the equivalence of the classical negative binomial and the overdispersed Poisson: set.seed(1) N <- 100000 # Just the number of random trials ... # We will be working with a gene that has an expected value of 1000 counts: µ <- 1000 # Poisson parameters: lambda <- µ # expected value # Gamma parameters: r <- 100 # shape parameter gamma_scale <- µ/r # scale parameter # Negative binomial parameters: size <- r p <- r/(µ+r) # follows from µ = r(1-p)/p, where µ is expected value of the NB q <- 1-p # The q parameter used by the authors ... # Now, our samples: pois <- rpois(N,lambda) gamma <- rgamma(N,shape=r,scale=gamma_scale) pois_gamma <- rpois(N,rgamma(10000,shape=r,scale=gamma_scale)) negbin <- rnbinom(N,size=r,prob=p) mean(pois) mean(gamma) mean(pois_gamma) mean(negbin) var(pois) var(gamma) var(pois_gamma) var(negbin) # Some histograms: xmin<-floor(min(c(pois,gamma,pois_gamma,negbin)))-10 xmax<-ceiling(max(c(pois,gamma,pois_gamma,negbin)))+10 par(mfrow=c(2,2)) hist(pois ,xlim=c(xmin,xmax), breaks=seq(xmin,xmax,10),main="Poisson") hist(gamma ,xlim=c(xmin,xmax), breaks=seq(xmin,xmax,10),main="gamma") hist(pois_gamma ,xlim=c(xmin,xmax), breaks=seq(xmin,xmax,10),main="overdispersed Poisson") hist(negbin ,xlim=c(xmin,xmax), breaks=seq(xmin,xmax,10),main="NB")
Now, the authors have chosen a variant of the classical parameterization, where q is the probability of failure rather than success (1-p). Everything else being equal, this boils down to: the closer q gets to 1, the larger the variance, the closer it gets to 0, the smaller the variance, with in the limit only the Poisson variance left.
Now we know about the parameters r and q, and we know what the NB distribution looks like, we can start asking questions about how it fits the data when we start looking along a gradient of different expression values. In order to do this, we want to have some kind of prior idea about which values we would expect for the parameters of the NB distribution, given our observations. To this end, EBSeq estimates a distribution of q values based on a sample of the data (I think?). In the hypothetical oversimplified case where we have libraries of equal size, only one condition and no isoforms, it would fit a beta distribution (which has two parameters: alpha and beta) on the estimates of q. In the case of working with two conditions, the program would use some starting values of alpha, beta, and the proportion of genes that belong to the DE or NDE class. This prior probability of belonging to "DE" is the quantity that the authors have chosen to call p. Now, a "mixture" model will be used to fit alpha and beta values both under the "NULL" hypothesis where all parameters are shared, or under the alternative hypothesis where only r is shared, and hence, the difference between means boils down to a difference between q (because means are a function of a constant r and a variable q). The updated values for alpha, beta and p will be used as an input to fit the whole thing over and over again until convergence.
Here, I have my first 3 questions:
1. I would be super happy if someone could explain why the assumption of a constant r parameter makes sense. Perhaps if we understand how the classical NB parametrization translates to RNA-seq logic, this would already start to make sense...
2. How exactly does the mixture model work? Does it fit a single NB on each isoform, with the N * p isoforms with the "most significant" differences fitting two q parameters and the N * (1-p) fitting only one? Or does it fit both models on each isoform, outputting something along the lines of p * likelihood(DE) + (p-1) * likelihood(NDE)?
3. How exactly does the scaling factor fit in this story?
Now, for generalizations involving multiple conditions and isoforms: If more than 1 isoform is possible per gene, the program still fits a single alpha parameter, but the beta parameter will be specific for all genes that have a single isoform, two isoforms, three, etc. My maths are too rusty to grasp to which extent this step complicates computation, but it turns out to work using the expectation-maximization (EM) algorithm. By the way: is EM also used for the single-isoform per gene case? If more than 2 conditions are present, there will still be a single alpha and one beta per isoform class, but now there will be a prior proportion for each selected pattern of conditions.
My 2 final questions for now:
4. All this prior distribution work should lead to better estimates of the variance or overdispersion, right? How exactly are these better estimates made?
5. What does the final calculation of posterior probabilities for belonging to a certain "constellation" of conditions look like? I guess: for each given isoform, each hypothesis gets a posterior that is calculated based on the prior probability of belonging to this class (are these the values that we see converging in the output$P?), a likelihood score based on the NB fit (which r and q are used here?) and finally the denominator that describes the likelihood under all competing models? Right?
Let's hope this thread may lead to answers, and enlighten me and colleagues who are keen to work with EBSeq and want to maximally understand how it works.
Best wishes and many thanks