1
0
Entering edit mode
21 months ago
antmantras ▴ 80

Hello everybody.

I want to check if I understand correctly the general statistical procedure carried out in edgeR for determining genes with a significant differential expression between several conditions. I have read both the manual and some of the group's publications such as link. However, I don't know if I have correctly understood the whole process. I am sorry in advance if some of the questions or assumptions are too obvious and incorrect, as I am trying to understand generalized linear models too.

Leaving out certain parts such as removing genes that do not have a minimum expression level and normalizing library sizes, the first thing is to estimate the dispersion of the genes, which is the sum of technical and biological variation. We can assume that all genes have a common dispersion (the average dispersion of all genes), that the dispersion of each gene is different (tagwise), or calculate a dispersion based on the average of the dispersions of genes with a similar level of counts (trended).

Once the dispersion has been estimated, the next step is to fit a GLM for a log-linear model (link function) µgi = x T i βg + log Ni for each gene (sorry for the formula, I don't know how to format it properly here). The aim is to find the parameters of the negative binomial distribution from which the observed counts are most likely to come (a maximum likelihood method is used for this). When the model has been fitted for a gene, we have an estimate of the mean number of reads that should map onto it for each of the conditions considered which, together with the dispersion parameter calculated above, allows us to calculate the variance of the gene counts for each experimental group. Finally, with an F-test, we can check whether there are significant differences between the variances/levels of gene expression.

Is this correct? Thanks in advance.

edger dge glm rna-seq • 1.6k views
2
Entering edit mode
21 months ago
Gordon Smyth ★ 7.3k

dispersion of the genes, which is the sum of technical and biological variation

The article that you link to explains how technical and biological variation can be separated. The squared coefficient of variation is the sum of technical and biological variation, but the negative binomial dispersion measures only the biological component. That is one of the key points of the published paper. Otherwise your description of how edgeR works is broadly correct.

The statistical theory behind edgeR is explained in the various journal articles. edgeR allows exact tests, likelihood ratio tests or quasi F-tests. The original edgeR approach in 2007 used exact tests. The glm likelihood ratio tests, introduced in 2011 and explained in the paper that you link to, allow for completely general experimental designs. Quasi F-tests, introduced in 2012, relax the distributional assumptions even more and achieve stricter FDR control compared to the older pipelines.

There also two types of dispersion. Exact tests and likelihood ratio tests use negative binomial dispersion. The quasi-likelihood approach introduces the possibility of quasi-dispersion as well, which allows edgeR to model more general types of technical variation and to account for uncertainty in dispersion estimation. Unlike, the negative binomial dispersion, the quasi-dispersion does reflect both technical and biological contributions.

0
Entering edit mode

Hi Gordon, thanks for your response. It seems that I ended up mixing up the concepts of dispersion and CV.

If I understand correctly, the dispersion estimated in the first part of the analysis is given by biological variation only. Then we can use it together with the estimated number of reads that should map onto the gene under the conditions specified to calculate the variances for the F-test.

0
Entering edit mode

0
Entering edit mode

Thanks Gordon