output TMM normalized counts with edgeR
2
9
Entering edit mode
3.2 years ago
guillaume.rbt ▴ 900

Hi all,

Sorry I know that this question has been asked several times, but unfortunately I haven't been able to find the right answer, or didn't understand.

I'm trying to get TMM normalized counts thanks to edgeR.

I understand that I have to compute normalization factors :

dgList <- calcNormFactors(dgList, method="TMM")


which gives me a normalization factor for all samples :

head(dgList$samples) group lib.size norm.factors S1 1 21087314 0.9654794 S2 1 16542810 1.1589117 S3 1 18875473 0.8763291 S4 1 15865414 1.0864038 S5 1 19179795 1.0488230 S6 1 15063992 1.0707007  But at this step I don't know what to do to get a matrix of normalized TMM counts. I know that I can get CPM normalized counts thanks to : cpm(dgList)  But CPM and TMM are not the same, right ? Thanks in advance for any of your input on this topic. differential expression RNA-Seq • 17k views ADD COMMENT 1 Entering edit mode It depends on the step in which you are analyzing your data. In this case, you can retrieve the TMM by running the next sequence of commands: dgeList <- calcNormFactors(dgList, method="TMM") #Normalize your raw data using the TMM method cpm(dgeList) #Obtain the TMM values of your data  The cpm function is aware whether data have been normalized using TMM. The answer from James Ashmore, in this post, contains a snippet of the source code of cpm function. ADD REPLY 0 Entering edit mode No I think pesudo.counts is the TMM normalized read counts, if you check with cpm function in edgeR, using pseudo.counts you will get the very similar cpm value by formula. Otherwise, you could use CPM formula to infer the TMM normalized read counts. ADD REPLY 14 Entering edit mode 3.2 years ago James Ashmore ★ 3.1k If you run the cpm function on a DGEList object which contains TMM normalisation factors then you will get TMM normalised counts. Here is a snippet of the source code for the cpm function: cpm.DGEList <- function(y, normalized.lib.sizes=TRUE, log=FALSE, prior.count=0.25, ...) # Counts per million for a DGEList # Davis McCarthy and Gordon Smyth. # Created 20 June 2011. Last modified 10 July 2017 { lib.size <- y$samples$lib.size if(normalized.lib.sizes) lib.size <- lib.size*y$samples$norm.factors cpm.default(y$counts,lib.size=lib.size,log=log,prior.count=prior.count)
}


The function checks to see if a DGEList object was provided with a lib.size and norm.factors column (created when you run calcNormFactors), if so then it uses those in the normalisation of the raw counts. You were right in your original post, just run the following and you will have TMM normalised counts:

dge <- calcNormFactors(dge, method = "TMM")
tmm <- cpm(dge)

0
Entering edit mode

Ok great, I did think there was something with the cpm function, but I get it know.

0
Entering edit mode

Just to again ensure that people understand:

cpm() produces log2-transformed normalised counts per million (log2 CPM). These are not the TMM-normalised counts. Take a look at lieven.sterck's answer.

In RNA-seq, the data processing steps usually go:

1. raw counts
2. normalised counts
3. transformed, normalised counts (the transformatio varies from program to program)
1
Entering edit mode

This is not accurate, the default returned value is not log2 transformed according to the documentation:

## S3 method for class 'DGEList'
cpm(y, normalized.lib.sizes = TRUE,
log = FALSE, prior.count = 2, ...)
## Default S3 method:
cpm(y, lib.size = NULL,
log = FALSE, prior.count = 2, ...)

0
Entering edit mode

Nobody said whether or not the log [base 2] ('log2') transformation was the default. However, when log = TRUE, it is log2 that is calculated.

0
Entering edit mode

It is a great informative post. Thanks. I think I'm a bit confused with normalization here though.

When someone says or shares normalized data, do they share cpm() value or,
They scale their gene counts by norm.factors to generate normalized gene counts?

0
Entering edit mode

It depends on what is the end-goal of the analysis (?). The Normalised counts will follow a negative binomial distribution; so, they would not be suitable for the vast majority of downstream analyses. On the other hand log2(cpm + 1) would follow a normal distribution, and be suitable for most downstream analyses.

0
Entering edit mode

The end goal is simple differential gene expression analysis using case-control (human tissue).

0
Entering edit mode

In that case, and assuming that you are a beginner in this area, I suggest that you follow the helpful vignettes for EdgeR or DESeq. If you need more specific help, then please open up a new question and be sure to describe the data that you currently have.

0
Entering edit mode

I've code and analysis pipeline developed on my end employing edgeR, limma, voom and other utilities. I'm reading and exploring when to use what and why, thus I thought to nudge this post because I'm confused and cannot understand further. Anyway, thanks~

1
Entering edit mode

Oh, yes, it is certainly difficult when starting in this field. If you want a very simple differential expression use-case with DESeq2, then take a look at Sections 1-3 of my own vignette: EnhancedVolcano.

The general pipeline for any bulk RNA-seq experiment is:

1. obtain raw counts (measured on negative binomial)
2. normalise raw counts (measured on negative binomial; normalised via DESeq2 or EdgeR)
3. perform differential expression analysis (tests are performed on the normalised counts, and the model that we use assumes that the data follows a negative binomial distribution. So, a test like linear regression or Student's t-test cannot be used here, as they assume a normal / binomial distribution)
4. transform the normalised counts for downstream tasks like clustering, PCA, and 'machine learning' (algorithms used include variance-stabilistion, regularised log, log2(CPM+1), etc)
4
Entering edit mode
3.2 years ago

No, CPM and TMM are not exactly the same indeed.

perhaps try this snippet of code:

dgList <- estimateCommonDisp(dgList)
dgList <- estimateTagwiseDisp(dgList)
norm_counts.table <- t(t(dgList$pseudo.counts)*(dgList$samples\$norm.factors))
write.table(norm_counts.table, file="./normalizedCounts.txt", sep="\t", quote=F)

1
Entering edit mode

cpm() uses TMM normalization factors automatically.

The edgeR documentation advises users not to use pseudo.counts but instead to use cpm or rpkm to export normalized values out of edgeR. The pseudo.counts are only for certain internal edgeR calculations and are not computed at all for most edgeR pipelines.

Even if one did want to export pseudo.counts, it would be incorrect to multiply them by norm.factors as your code does because the norm.factors were already incorporated into the pseudo.counts when they were computed.

0
Entering edit mode

Could you explain me what the "pseudo.counts" are?

0
Entering edit mode

There's no need to calculate the TMM values yourself, the cpm function should do it for you given a DGEList with the lib.size and norm.factors columns present (which you get after running calcNormFactors).

0
Entering edit mode

0
Entering edit mode

Hi, I am wondering why the estimateTagwiseDisp() is needed? Does it change anything for the output?