Question

A question regarding log2 transformation in EdgeR

2

Entering edit mode

3.2 years ago

RNAseqer ▴ 260

Hello everyone, I have a question as to EdgeR's differential expression analysis and the use of log2 transformation as part of the normalization process. I want to test for DE using TMM normalized log2 transformed counts data (log2 transformation serving as a means to reduce skew and the effects of outliers). I know it is possible to output normalized, log2 transformed values by using cpm() on the output of EdgeR's cpm() function:

#Use counts and phenotype data to create a dgelist object:

group <- phenotype_data_matrix$test_group
df.counts <- as.data.frame(counts_matrix)
dgelist <- DGEList(counts=df.counts, group=group)

# conduct normalization. The normalization factors will be stored in the object with an unaltered copy of the counts matrix:

calc.normfact.out <- calcNormFactors(dgelist, method = "TMM")

#extract normalized, log2 transformed  expression data using cpm() note: a default prior count of .25, scaled to effective library size will be added by EdgeR to prevent taking log of 0 counts:

cpm.tmm.log2 <- cpm(calc.normfact.out, log=TRUE, normalized.lib.sizes = TRUE)

My concern is this: does EdgeR use log2 transformation as part of the normalization prior to estimating differential expression? If it is not a default, is it possible to secify this as part of the normalization process, or would such a transformation of counts into log2 transformed values actually violate the assumptions EdgeR makes about the RNA-seq data's distribution? Below is a copy of the code I'm using to test for DE between group A and group B,

#additional covariates you want in your glm:

batch <- phenotype_data_matrix$Batch
sex <- phenotype_data_matrix$sex

#set up model with additional relevant covariates:

design <- model.matrix(~group + batch + sex)

# carry out DE analysis to identify DE genes between the two categories in 'group' (this is specified by 'coef=2') adjusting for Batch and Sex using glm():

dge.estDisp.out <- estimateDisp(calc.normfact.out, design)
fit <- glmFit(dge.estDisp.out, design)
lrt <- glmLRT(fit, coef = 2)

I have included my code here so that A )in case there is there is a simple way to modify it to use log2 transformed-normalized values provided that is not the default and B) Allow you to point out any gross mistakes or misconceptions on my part. Many thanks for any help you can provide!

EdgeR log2 differentail expression analysis • 4.7k views

ADD COMMENT • link updated 3.2 years ago by Gordon Smyth ★ 7.0k • written 3.2 years ago by RNAseqer ▴ 260

score 6 · Accepted Answer · 2021-02-18

6

Entering edit mode

3.2 years ago

Gordon Smyth ★ 7.0k

edgeR automatically uses the TMM normalization factors. You don't have to change anything.

However you should understand that TMM normalizes the library sizes, not the counts. So there is really no such thing as TMM normalized counts. edgeR does not use log2 tranformed counts at any stage, either for normalization or for differential expression. See the edgeR User's section on normalization.

The values output by cpm() are for graphical purposes and for export to other software, but you cannot reproduce edgeR's analyses using them. On the other hand, you can input the log2CPM values from edgeR into limma for an alternative differential expression analysis based on normality instead of the negative binomial distribution. That's called the limma-trend analysis pipeline. There's really no need to do that unless you want to use extra features of limma like intrablock correlations or empirical quality weights. For a regular differential expression analysis, the edgeR quasi-likelihood pipeline you are using already is ideal.

ADD COMMENT • link 3.2 years ago by Gordon Smyth ★ 7.0k

0

Entering edit mode

Ok, so if it adjusts library size (creating an effective library size) and then you extract values using cpm() aren't the measurements you are getting for each gene a TMM normalized (and optionally log2 transformed) measurement of that particular genes expression? If you extract in this way what you are using for visualization is normalized gene-expression data correct? Even if it is not 'count' data any more. Am I correct in that?

Additionally, is there a way to use normalized log2 transformed data for DE analysis in the DEseq package?

ADD REPLY • link 3.2 years ago by RNAseqer ▴ 260

1

Entering edit mode

You can use TMM to normalize the library sizes. And then the normalized library sizes can be used to compute logCPM values. You can call the results TMM normalized counts if you like, but I do not do so because I think that terminology is misleading. In fact it causes the sort of misunderstandings that have lead to your question here.

No you cannot use log2 transformed counts in DESeq. DESeq is a negative binomial package, just like edgeR. log transformed counts are not negative binomial.

I am not clear what is worrying you. As far as I can see, the DE analysis you've done is perfectly ok.

ADD REPLY • link 3.2 years ago by Gordon Smyth ★ 7.0k

0

Entering edit mode

OK Thank you, I was just trying to clarify things to make sure I fully understood. Thank you for answer my questions!

ADD REPLY • link 3.2 years ago by RNAseqer ▴ 260