First, sorry if I am missing something basic - I am a programmer recently turned bioinformatician so I still don't know a lot of stuff. This is a cross post with a question on Bioinformatics SE, hope this is not bad form (its my first post on both platforms).

While it is obvious that scRNA-seq data contain lots of zeroes, I couldn't find any detailed explanation of why they occur more frequently than what would be expected from a negative binomial distribution - except for short notices along the lines of "substantial technical and biological noise". For the following text, let's assume we are looking at a single gene that is activated at approximately the same level across all cells.

If zeroes were caused solely by low capture efficiency and sequencing depth, all observed zeroes should be explained by low mean expression across cells. This however does not seem to be the case as the distribution of gene counts across cells often has more zeroes than would be expected from a negative binomial model. For Example the ZIFA paper explicitly uses a zero-inflated negative binomial distribution to model scRNA-seq data. Modelling scRNA-seq as zero-inflated negative binomial seems widespread throughout the literature.

However assuming negative binomial distribution for the original counts (as measured in bulk RNA-seq) and assuming that every RNA fragment of the same gene from every cell has approximately the same (low) chance of being captured and sequenced, the distribution across single cells should still be negative binomial (see this Math SE question for related math).

So the only remaining possible cause is that inflated zero counts are caused by PCR. Only non-zero counts (after capture) are amplified and then sequenced, shifting the mean of the observed gene counts away from zero while the pre-PCR zero counts stay zero. Indeed some quick simulations show that such a procedure could occasionally generate zero-inflated negative binomial distributions. This would suggest that excessive zeroes should not be present when UMIs are used - I checked one scRNA-seq dataset with UMIs and it seems to be fit well by plain negative binomial.

Is my reasoning correct? Thanks for any pointers.

Thanks for the info. I however believe that it does not answer my question. AFAIK phased gene expression should be more-or-less accounted for by the negative binomial distribution. Or am I wrong on this point? And note that I am not interested in why the large number of zeroes occur, but why there are more zeroes than what could be explained by negative binomial distribution (which can allow for a lot of zeroes if the mean is low or dispersion is high).

Burst transcription won't fit a negative binomial distribution, rather it'll be either zero inflated or show something like multi-modal negative binomial variance.