Is it appropriate to assume genes of similar expression level also have similar variance?
4
3
Entering edit mode
6.7 years ago
-_- ★ 1.1k

In high-throughput assays, this limitation can be overcome by pooling information across genes, specifically, by exploiting assumptions about the similarity of the variances of different genes measured in the same experiment [1].

This is a quote from the DESeq2 paper, .

Update (2018-01-15), adding another quote from the DESeq paper,

This assumption is needed because the number of replicates is typically too low to get a precise estimate of the variance for gene i from just the data available for this gene. This assumption allows us to pool the data from genes with similar expression strength for the purpose of variance estimation.

But I haven't really found any justification for this assumption.

Tools for detecting differential expression, e.g. DESeq2, edgeR relies on the assumption that genes with similar expression also have similar variance, that's how they could estimate variance based on very small sample size (e.g. 3) for two groups (control vs disease). Otherwise, it would seem navie to estimate variance based on just three measures.

However, I wonder why this is a valid assumption? Previously, it might be difficult to get a answer, but now we have TCGA, so I plotted the mean vs standard deviation (just square root of variance) over all 173 leukemia (LAML) samples from TCGA for all 20,392 gene expression levels downloaded from http://firebrowse.org/.

LAML mean expression vs standard deviation

I also plotted mean vs variance, just standard deviation squared, curious about what it would look like.

LAML mean expression vs variance

The dashed line is diagonal, plotted here to just get a sense of the slope of the scatter

I personally find the assumption hard to agree. Any ideas?

Update (2018-01-15): as suggested by @i.sudbery, I added a similar plot in raw read count instead of TPM

LAML mean expression vs standard deviation, in read count

deseq differential-expression • 6.2k views
ADD COMMENT
0
Entering edit mode

IMO, the last plot kind of supports the relationship between the mean and standard deviation. See how the lower bound of standard deviation increases with the mean. Of course, for some genes, the standard deviation is much higher than the average trend. This is expected from biological data and it is also taken care of in DESeq2 where the standard deviation of the outliers are not shrinked.

PS : I think that the dashed lines are misleading and should be removed from your plots.

ADD REPLY
0
Entering edit mode

Why would you only focus on the lower bound, then? If the assumption is true, the above data should be essentially a line because, at a given mean expression level, there is only one value for the variance, isn't it?

Yes, DESeq2 handles them separately

If, on the other hand, an individual gene’s dispersion is far above the distribution of the gene-wise dispersion estimates of other genes, then the shrinkage would lead to a greatly reduced final estimate of dispersion. We reasoned that in many cases, the reason for extraordinarily high dispersion of a gene is that it does not obey our modeling assumptions; some genes may show much higher variability than others for biological or technical reasons, even though they have the same average expression levels. In these cases, inference based on the shrunken dispersion estimates could lead to undesirable false positive calls. DESeq2 handles these cases by using the gene-wise estimate instead of the shrunken estimate when the former is more than 2 residual standard deviations above the curve.

But I find it even hard to argue that the majority of the genes follow this assumption qualitatively. I hope to find more support, but I felt this assumption was made mostly in order to leverage the existing of a large number of genes, and proceed with the analysis. Otherwise, with three replicas and a single gene, there isn't really a robust way to estimate the variance.

ADD REPLY
1
Entering edit mode

If the assumption is true, the above data should be essentially a line because, at a given mean expression level, there is only one value for the variance, isn't it

As i.subery stated in its last comment below, this is more about folowing a trend than an exact relationship. Also, note that the relationship was never described as linear (for edgeR for instance the relationship is assumed as: σ2 = μ + αμ2).

I felt this assumption was made mostly in order to leverage the existing of a large number of genes, and proceed with the analysis.

You are right. This assumption is not perfect, but it has been proven helpful in many cases.

ADD REPLY
0
Entering edit mode

Thanks.

If the assumption is true, the above data should be essentially a line because, at a given mean expression level, there is only one value for the variance, isn't it

I didn't expect the relationship be to linear, either. It could be a non-linear line.

I tend to agree with you. Do you mind giving a specific example that "it has been proven helpful in many cases"? I had this bias that if this assumption doesn't hold, then how do we trust the genes identified as deferentially expression except for some very obvious ones.

ADD REPLY
2
Entering edit mode
6.7 years ago

DESeq and edgeR work with counts rather than the TPMs used in the question. Under the assumption that these counts are negatively bionomially distributed, there should be an effect of expression level on variance.

Looking at your plots above, I'd say 1) you should be plotting counts rather than TPM, 2) Gene with, say, an expression of 30 TPM do seem to cluster around a standard deviation of 15 and genes with an expression of 10TPM more around 5ish.

The mean-variance plot is one of diagnostics that can be produced by both DESeq and edgeR during the course of the analysis. It generally shows a pretty good fit.

Neither DESeq, nor edgeR just replace the standard deviation of a gene with that of genes of a similar expression level, rather they use empirical bayes to shrink the variance towards the mean-variance trend line. Thus a gene can have much higher (or lower) variance than the mean for its expression level, but their most be good evidence for it.

ADD COMMENT
1
Entering edit mode

Thanks. I thought count might be more desirable, too. Do you have any link for some sort of mean-vs-variance plot in count for relatively large sample size?

Shrink the variance is again based on the assumption rather than a justification for using such an assumption in the first place, right?

I am basically looking for evidence that supports the assumption.

ADD REPLY
0
Entering edit mode

I added the same kind of plot, but in read count, do you have any comment, please?

ADD REPLY
0
Entering edit mode

I actually thought your question would have been answered within the first 7 minutes. The assumption AND justification are based on mathematical theory. You have to remember that data, such as non-negative values that have a certain distribution like count data having a negative binomial distribution, has properities that correspond to mathematical functions that have similar properties and so can be applied to analyze the data to get meaningful results. Also, as you may already know you could instead apply OLS with Log transformation to count data... because the justification for doing so is written in its theory.

ADD REPLY
1
Entering edit mode

@theobroma22, I don't think you are answering my question. I am not asking about negative binomial distribution at all. FYI, count data could have a Poisson distribution, too. Also, could you pinpoint what mathematical theory you are referring to? I recall modeling count data as negative binomial distribution is more empirical than theoretical.

Maybe it helps to think of my question another way, What if there is just one gene, how are you going to estimate its variance with negative binomial distribution?

ADD REPLY
0
Entering edit mode

The standard empirical bayes methodology would be to assume that all genes have the same variance unless there was evidence to the contrary and use the all replicates of all genes to obtain a best estimate of this prior equal variance and then apply evidence from each sample to shift each gene away from this average (see David Robinsons very good introduction to empircal bayes. )

However, examination of the data (as in your plots above) suggest that the assumption that all genes have the same variance is clearly not true - more highly expressed genes clearly have a higher variance. On the basis of the relationship shown in your plots we can adjust our prior for the variance based on the expression level - more highly expressed genes have a higher variance. Think of this as putting a trend line though your plots avbove: in the absence of other evidence, this trend line represents your best guess at the variance of a gene based only on its expression. Whether we create this prior empirically of parametrically depends on the package.

Now, with our trended prior in hand we can apply the evidence from each individual gene to move it away from this average.

On the Negative Binomial distribution, there are good theoretical reasons for its use:

Variance in in sequencing experiments comes from three separate places:

  • Sampling variance due to repeated draws from the same population. Theory tells us that this should be binomial (several million trials where each trial has a success rate equal to fraction of transcripts in the cell a gene represents), and that the numbers are large enough that in practice poison should be a good fit. This is indeed the case: if you sequence the same library twice, you get a very good fit to the poissons.

  • Technical variance due to differences in the library preparations, cluster generation, machine function etc. We don't have a good handle on the distribution of this, but tests actually suggest that this source of variance might be minimal.

  • Biological variation: The two samples really don't have the same amount of transcript in them and this transcript varies between samples with some variance. We often think of the actaul expression rate of a gene as being normally distributed, but this can't actually be true because expression can't be less than 0. In fact the Gamma distribution holds many of the properties we would like for a distribution of actaul expression levels.

So we have two sources of variance: the actual level of expression (which is gamma distributed) and the variation in our measurement of that level (which is poisson). The convolution of a poisson distribution with a gamma distributed rate parameter gives a negative binomial distribution for the final measurement of expression level. Or equivalently the Gamma is the conjugate prior on the poisson likelihood, and gives a negative binomial posterior.

ADD REPLY
0
Entering edit mode

Wow. Thanks a lot for your detailed comment, really appreciate it! I agree with that all. I have no question that negative binomial deals with the overdispersion problem while Poisson assumes the mean and variance are always the same. But still, I am looking for other's opinion on the assumption behind all genes being treated together. You mentioned that

assume that all genes have the same variance

which is clearly not right based on the data. But the assumption I am asking about is a much weaker version of this. It's that only genes with the same expression have the same variance (although I used "similar" in the title), which also is consistent with

more highly expressed genes clearly have a higher variance

If this assumption is true, the above data should be essentially a line because, at a given mean expression level, there is only one value for the variance.

ADD REPLY
1
Entering edit mode

It's that only genes with the same expression have the same variance

You were right the first time with similar, rather than the same. What the empirical bayes says is not that all genes have the same variance, but that in the absence of any data, if you guessed the average variance for genes of that expression level, you'd be right more often than anything else you could guess. We can see this is true from looking at your plot above, if you were to draw a trend line through the plot, most points would fall close to that line. For those that don't, you have two options: 1) The variance for that gene really is very different from the majority of genes 2) The estaimte of variance is poor in this case because of the low number of replicates. You can see the genes from 1 as the cloud of genes above the central density in your plots. The task is to distinguish one from the other: genuinely highly variable genes from genes where the variance estimate has been poor. This is where the empirical bayes shrinkage comes in - it weights the evidence for those to probabilities and prioritises the prior or the measured variance based on the strength of the evidence for each.

The empirical proof for this is that on your plots most points do fall close to the trend line, with just a few significantly above the line.

ADD REPLY
0
Entering edit mode

Thanks.

in the absence of any data, if you guessed the average variance for genes of that expression level, you'd be right more often than anything else you could guess

This is insightful. I need to read more about empirical bayes. It sounded very reasonable to me at first, but now given we have data (e.g. TCGA), and I am curious if this empirical rule really applies to the case of gene expression.

Just to clarify in case I wasn't clear in my question. The data plotted is out of 173 leukemia samples, so it is unlikely to due insufficient sample size (your 2nd option).

The empirical proof for this is that on your plots most points do fall close to the trend line, with just a few significantly above the line.

The next interesting question I may look into is how many genes are indeed close to a "fitted trend line", more quantitatively. Is it really true just a few lie significantly above the line. This is also hinted my reply to @Carlo Yague's comment.

ADD REPLY
1
Entering edit mode

You may wish to look into Schurch et al, 2016 (and related papers) out of Geoff Barton'l lab who did a two condition experiment with 47 replicates in each condition in order to benchmark various assumptions and methods for RNAseq.

ADD REPLY
0
Entering edit mode

This paper looks like just what I need, Thanks a lot!

ADD REPLY
1
Entering edit mode
6.6 years ago

Extremely Interesting Q!

I guess that your plotting looks haywire because you haven’t log-transformed the data. The large dynamic range of count RNA-seq data doesn’t allow proper comparing the dispersion in lowly vs highly expressed genes unless you log-transform it (which is what almost all tool for DiffExp analysis does).

If you see the supplementary Figure 1 (Additional file 1) of the DESeq2 paper, the assumption doesn’t look that bad. In log-transformed scale, the genes with low logCPM show very high, but similar dispersion. So is the case with genes with very high logCPM that they show smaller, but similar dispersion (see the red trend line).

DESeq2  supplementary Figure 1 http://ibb.co/b4XdDb

Similar trends can be found in voom paper (Figure 1 https://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-2-r29

Voom Figure 1 http://ibb.co/hPnjmw

Also, note that you are plotting different tumor samples, and the biological variability among tumor samples is known to be very high. This and the fact that you are comparing a large number (n=173) of samples would, of course, inflate your variance estimates. To give a perspective, in the voom Figure above, you are comparing with the figure (D), not with Figure A or B, which have comparatively low levels of biological variations.

PS: I tried my best to embed the images, but somehow it is not showing. So, I've provided the direct links too.

ADD COMMENT
1
Entering edit mode

For the image, you need to click the image till you see a url ending with png or similar, e.g. http://image.ibb.co/dc7BYb/DESeq_Fig_S1.png. If you use Chrome, right click the image and then click Copy Image Address, then you will get the url.

Great point, thank you! I will check the log transformation in a bit. What is CPM, btw?

ADD REPLY
0
Entering edit mode

Got the images working, thank you! CPM = counts per million

ADD REPLY
0
Entering edit mode

Quick question, please. How do people justify the adding of one, clearly it creates an artifact. I guess genes with a read size around one are just ignored, based on the d subplot of your attached image?

ADD REPLY
1
Entering edit mode

The choice of pseudo-count is not clear, and there are often threads on Twitter debating the issue. The rlog transformation from DESeq takes a principled approach to finding a suitable psuedo-count from the data.

ADD REPLY
0
Entering edit mode
6.7 years ago
theobroma22 ★ 1.2k

However, I wonder why this is a valid assumption?

It is agreeable that the assumption is valid based on the negative binomial function, which is able to control for or handle the variation of genes that are lowly expressed. In addition, I typically find in the genomes that I work with that gene paralogs or duplicated genes can also vary similarly.

Is it agreeable that the assumption is proper?

Did you fit the data using a negative binomial function or just a generalized linear model? Also, use the Root-Mean-Squared-Deviation and not just the SD squared...then you might agree that those assumptions are correct.

ADD COMMENT
0
Entering edit mode

Thank you for replyI am a bit confused... I haven't done any fitting yet, what's plotted is just data from TCGA. Could you please explain what you mean a bit more?

ADD REPLY
0
Entering edit mode

I’ve never used TGCA, so not sure how the data you used was processed. Now I’m confused about your post...lol. Typically, expression data is processed, normalized and then fit to a statistical model. For more in-depth discussion you can watch videos on YouTube.

ADD REPLY
0
Entering edit mode

Thanks. Well, all what I asking is about the validity of the assumption behind DESeq2. What I've plotted is just the mean and variance of all genes from 173 leukemia samples. Each dot is a gene. A statistical model is used to find the variance, but in this post, I just plotted it since there are enough samples to have a good estimate of the variance.

ADD REPLY
0
Entering edit mode

It then becomes obvious that your exercise is not as fruitful or valid as you initially thought.

ADD REPLY
0
Entering edit mode

Do you mind providing some guidance?

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Thank you for your link.

But I think you misunderstood my question. I am questioning the validity of the assumption that "genes with similar expression level have similar variance", it's NOT about which model to use at all. My statement of the assumption is equivalent to/derived from the third assumption as described in the DESeq paper, https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-10-r106

This assumption is needed because the number of replicates is typically too low to get a precise estimate of the variance for gene i from just the data available for this gene. This assumption allows us to pool the data from genes with similar expression strength for the purpose of variance estimation.

Also, towards the end of the video you sent, it explicitly said negative binomial model should NOT be used for small samples. How does that help justify applying DE analysis to a few samples (e.g. 3), as I mentioned in the question?

ADD REPLY
0
Entering edit mode

I've updated my question, hopefully it's clearer now.

ADD REPLY
0
Entering edit mode
6.7 years ago
theobroma22 ★ 1.2k

I’m trying my best to get to the root of your question. Obviously, for just one gene the negative binomial would not apply, and makes no sense to investigate. You can see from your mean v. variance plots that your data is over-dispersed. For data that has this property, like count data, you need a function which is able to control for this dispersion such that the results are meaningful, i.e. you can detect differential gene expression. To see how this works you can refer back to @i.sudbery’s comment, or if you can read math, all of the theorems are out there, so now you can see how the functions are working. If you can’t read math, you’re barking up the wrong tree. As such, you can apply Poisson but mathematicians and statisticians showed that Poisson didn’t control the variation for lowly expressed genes as well as it did with the negative binomial and OLS with log transformation. There are a ton of papers regarding comparing these models to many different types of count datasets in human, mouse, plants...so, this is what is used, and now there are many packages which use this method. Honestly, who are you to question the basis of these packages when they are created by very smart people - graduates of Harvard, Princeton, University of Queensland, Standford, MIT, Oxford...to me, you are still learning and it’s good to question this, but your questions reflect your lack of knowledge and resource in this area!

ADD COMMENT

Login before adding your answer.

Traffic: 2034 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6