Question

output TMM normalized counts with edgeR

10

Entering edit mode

5.9 years ago

guillaume.rbt ★ 1.0k

Hi all,

Sorry I know that this question has been asked several times, but unfortunately I haven't been able to find the right answer, or didn't understand.

I'm trying to get TMM normalized counts thanks to edgeR.

I understand that I have to compute normalization factors :

dgList <- calcNormFactors(dgList, method="TMM")

which gives me a normalization factor for all samples :

head(dgList$samples)

group lib.size norm.factors
S1     1 21087314    0.9654794
S2     1 16542810    1.1589117
S3     1 18875473    0.8763291
S4     1 15865414    1.0864038
S5     1 19179795    1.0488230
S6     1 15063992    1.0707007

But at this step I don't know what to do to get a matrix of normalized TMM counts.

I know that I can get CPM normalized counts thanks to :

cpm(dgList)

But CPM and TMM are not the same, right ?

Thanks in advance for any of your input on this topic.

differential expression RNA-Seq • 37k views

ADD COMMENT • link updated 23 months ago by N. • 0 • written 5.9 years ago by guillaume.rbt ★ 1.0k

2

Entering edit mode

It depends on the step in which you are analyzing your data. In this case, you can retrieve the TMM by running the next sequence of commands:

dgeList <- calcNormFactors(dgList, method="TMM") #Normalize your raw data using the TMM method
cpm(dgeList) #Obtain the TMM values of your data

The cpm function is aware whether data have been normalized using TMM. The answer from James Ashmore, in this post, contains a snippet of the source code of cpm function.

ADD REPLY • link 2.9 years ago by rodolfo.peacewalker ▴ 390

0

Entering edit mode

No I think pesudo.counts is the TMM normalized read counts, if you check with cpm function in edgeR, using pseudo.counts you will get the very similar cpm value by formula. Otherwise, you could use CPM formula to infer the TMM normalized read counts.

ADD REPLY • link 2.9 years ago by shengxin1015 ▴ 10

score 18 · Accepted Answer · 2018-05-30

18

Entering edit mode

5.9 years ago

James Ashmore ★ 3.4k

If you run the cpm function on a DGEList object which contains TMM normalisation factors then you will get TMM normalised counts. Here is a snippet of the source code for the cpm function:

cpm.DGEList <- function(y, normalized.lib.sizes=TRUE, log=FALSE, prior.count=0.25, ...)
#   Counts per million for a DGEList
#   Davis McCarthy and Gordon Smyth.
#   Created 20 June 2011. Last modified 10 July 2017
{
    lib.size <- y$samples$lib.size
    if(normalized.lib.sizes) lib.size <- lib.size*y$samples$norm.factors
    cpm.default(y$counts,lib.size=lib.size,log=log,prior.count=prior.count)
}

The function checks to see if a DGEList object was provided with a lib.size and norm.factors column (created when you run calcNormFactors), if so then it uses those in the normalisation of the raw counts. You were right in your original post, just run the following and you will have TMM normalised counts:

dge <- calcNormFactors(dge, method = "TMM")
tmm <- cpm(dge)

ADD COMMENT • link 5.9 years ago by James Ashmore ★ 3.4k

0

Entering edit mode

Ok great, I did think there was something with the cpm function, but I get it know.

ADD REPLY • link 5.9 years ago by guillaume.rbt ★ 1.0k

0

Entering edit mode

Just to again ensure that people understand:

cpm() produces log2-transformed normalised counts per million (log2 CPM). These are not the TMM-normalised counts. Take a look at lieven.sterck's answer.

In RNA-seq, the data processing steps usually go:

raw counts
normalised counts
transformed, normalised counts (the transformatio varies from program to program)

ADD REPLY • link 4.6 years ago by Kevin Blighe 87k

1

Entering edit mode

This is not accurate, the default returned value is not log2 transformed according to the documentation:

## S3 method for class 'DGEList'
cpm(y, normalized.lib.sizes = TRUE,
       log = FALSE, prior.count = 2, ...)
## Default S3 method:
cpm(y, lib.size = NULL,
       log = FALSE, prior.count = 2, ...)

ADD REPLY • link 4.5 years ago by shengwei ▴ 40

0

Entering edit mode

Nobody said whether or not the log [base 2] ('log2') transformation was the default. However, when log = TRUE, it is log2 that is calculated.

ADD REPLY • link 4.5 years ago by Kevin Blighe 87k

0

Entering edit mode

It is a great informative post. Thanks. I think I'm a bit confused with normalization here though.

When someone says or shares normalized data, do they share cpm() value or,
They scale their gene counts by norm.factors to generate normalized gene counts?

ADD REPLY • link 3.9 years ago by Bioinformatics_NewComer ▴ 330

0

Entering edit mode

It depends on what is the end-goal of the analysis (?). The Normalised counts will follow a negative binomial distribution; so, they would not be suitable for the vast majority of downstream analyses. On the other hand log2(cpm + 1) would follow a normal distribution, and be suitable for most downstream analyses.

ADD REPLY • link 3.9 years ago by Kevin Blighe 87k

0

Entering edit mode

The end goal is simple differential gene expression analysis using case-control (human tissue).

ADD REPLY • link 3.9 years ago by Bioinformatics_NewComer ▴ 330

0

Entering edit mode

In that case, and assuming that you are a beginner in this area, I suggest that you follow the helpful vignettes for EdgeR or DESeq. If you need more specific help, then please open up a new question and be sure to describe the data that you currently have.

ADD REPLY • link 3.9 years ago by Kevin Blighe 87k

0

Entering edit mode

I've code and analysis pipeline developed on my end employing edgeR, limma, voom and other utilities. I'm reading and exploring when to use what and why, thus I thought to nudge this post because I'm confused and cannot understand further. Anyway, thanks~

ADD REPLY • link 3.9 years ago by Bioinformatics_NewComer ▴ 330

2

Entering edit mode

Oh, yes, it is certainly difficult when starting in this field. If you want a very simple differential expression use-case with DESeq2, then take a look at Sections 1-3 of my own vignette: EnhancedVolcano.

The general pipeline for any bulk RNA-seq experiment is:

obtain raw counts (measured on negative binomial)
normalise raw counts (measured on negative binomial; normalised via DESeq2 or EdgeR)
perform differential expression analysis (tests are performed on the normalised counts, and the model that we use assumes that the data follows a negative binomial distribution. So, a test like linear regression or Student's t-test cannot be used here, as they assume a normal / binomial distribution)
transform the normalised counts for downstream tasks like clustering, PCA, and 'machine learning' (algorithms used include variance-stabilistion, regularised log, log2(CPM+1), etc)

ADD REPLY • link 3.9 years ago by Kevin Blighe 87k

0

Entering edit mode

hi, what is the impact on defining normalized.lib.sizes = TRUE or normalized.lib.sizes = FALSE, when getting transformed, normalised counts.

ADD REPLY • link 23 months ago by N. • 0

score 5 · Accepted Answer · 2018-05-30

5

Entering edit mode

5.9 years ago

lieven.sterck 15k

No, CPM and TMM are not exactly the same indeed.

perhaps try this snippet of code:

dgList <- estimateCommonDisp(dgList)
dgList <- estimateTagwiseDisp(dgList)
norm_counts.table <- t(t(dgList$pseudo.counts)*(dgList$samples$norm.factors))
write.table(norm_counts.table, file="./normalizedCounts.txt", sep="\t", quote=F)

ADD COMMENT • link 5.9 years ago by lieven.sterck 15k

3

Entering edit mode

cpm() uses TMM normalization factors automatically.

The edgeR documentation advises users not to use pseudo.counts but instead to use cpm or rpkm to export normalized values out of edgeR. The pseudo.counts are only for certain internal edgeR calculations and are not computed at all for most edgeR pipelines.

Even if one did want to export pseudo.counts, it would be incorrect to multiply them by norm.factors as your code does because the norm.factors were already incorporated into the pseudo.counts when they were computed.

ADD REPLY • link 2.9 years ago by Gordon Smyth ★ 7.0k

0

Entering edit mode

Thank for your help.

Could you explain me what the "pseudo.counts" are?

ADD REPLY • link 5.9 years ago by guillaume.rbt ★ 1.0k

0

Entering edit mode

There's no need to calculate the TMM values yourself, the cpm function should do it for you given a DGEList with the lib.size and norm.factors columns present (which you get after running calcNormFactors).

ADD REPLY • link 5.9 years ago by James Ashmore ★ 3.4k

0

Entering edit mode

ADD REPLY • link 5.0 years ago by iraia.munoa ▴ 130

0

Entering edit mode

Hi, I am wondering why the estimateTagwiseDisp() is needed? Does it change anything for the output?

ADD REPLY • link 4.3 years ago by darklings ▴ 570