Question

export edgeR normalised OTU table

0

Entering edit mode

4.5 years ago

fyfes ▴ 70

Hello, I want to normalise my microbiome data (OTU table) using edgeR package and export my normalised matrix ( I want to test some normalisation methods, normalised data will serve to calculate beta diversity and visualise by PCoA). I went through bioconductor forum and biostars and I still don't know want is the proper way to export normalised data from edgeR.

So, I read my matrix and calcNorm:

d <- DGEList(counts = d, group=group)
d = calcNormFactors(d)

what happens next?: 1st option

d = estimateCommonDisp(d, verbose=TRUE)
normalised_df <- d$pseudo.alt

2nd option

normalised_df <- cpm(d,  normalized.lib.sizes = T)

Thank you!

microbiome normalisation • 2.3k views

ADD COMMENT • link updated 4.5 years ago by ATpoint 88k • written 4.5 years ago by fyfes ▴ 70

0

Entering edit mode

If your aim is to obtain TMM normalized counts, this Biostars post might help you: A: output TMM normalized counts with edgeR

ADD REPLY • link 4.5 years ago by antonioggsousa 3.4k

0

Entering edit mode

thank you, so it does normalise my data without doing cpm?

ADD REPLY • link 4.5 years ago by fyfes ▴ 70

0

Entering edit mode

Yes, it tries to obtain the TMM normalized counts (you cannot obtain them natively from edgeR). Although if your aim is not this, I would choose the method described by @ATpoint below (the recommended normalization by the edgeR authors).

ADD REPLY • link 4.5 years ago by antonioggsousa 3.4k

score 2 · Answer 1 · 2021-01-29

The simplest solution is in fact to use your first two lines of code and then run edgeR::cpm(d, log=FALSE) to get normalized counts, see ?cpm for details. If the norm factors have been calculated before then the cpm function will use them, if not then it only corrects for library size differences. I agree that this all can be confusing because there are in fact multiple options to calculate CPMs in edgeR.

The above suggestion is probably what most people find useful as it gets a set of normalized counts that are simple to calculate and do not depend on the experimental design. You could then put these normalized data to the log scale, e.g. via log2(cpms+1) or alternatively use log=TRUE of the CPM function. The latter would give different results than putting the CPMs on log scale manually and will also produce values smaller than zero (as log of values smaller 1 is negative), which I usually find undesirable, e.g. for plotting purposes when intuitively you expect the smallest possible value being a zero. The authors for sure have sound reason to do it the way they implemented it, but in the end you have so see which strategy is usable for your analysis, I always use the log2(cpm+1) approach.

If there are doubts that neither the manual, not the help sections of the functions can answer you can open a question at support.bioconductor.org, the authors are outstandingly responsive, but please be sure to first use google to ensure that this has not been asked (many times) before.

Based on the documentation in ?cpm you can also calculate CPMs based on the DGEGLM or DGELRT objects (after running glmFit or glmQLFit) rather than on the DGEList which you have above with the cpm function, but I cannot tell you in greater detail when exactly this strategy would be desirable, and there does not seem to be documentation available beyond what is in the function details section, at least I did not really find it in the manual.

I would therefore go for the suggestion in the first paragraph. I would not use the linked solution as this is doing non-standard calculations on pseudocounts that the user guide explicitely discourages, section 2.8.7

The pseudo-counts are computed for a specific purpose, and their computation depends on the experimental design as well as the library sizes, so users are advised not to interpret the psuedo-counts as general-purpose normalized counts. They are intended mainly for internal use in the edgeR pipeline.