Question

EdgeR analysis with CPM normalzed counts

0

Entering edit mode

3.6 years ago

silas008 ▴ 160

Hey, guys.

I am used to analyse data from raw reads. Now I got a table on GEO Datasets containing CPM normalized counts by EdgeR. Can I proceed normally from it without calcNormFactros(), right?

Like

> x <- read.delim("Table.csv",row.names="Gene")
> group <- factor(c(1,1,1,2,2,2))
> y <- DGEList(counts=x,group=group)
> design <- model.matrix(~group)
> y <- estimateDisp(y,design)
> et <- exactTest(y)

Thank very much for any help

RNA-Seq • 4.2k views

ADD COMMENT • link updated 3.6 years ago by ATpoint 81k • written 3.6 years ago by silas008 ▴ 160

1

Entering edit mode

You need raw counts for edgeR and DESeq2, primarily because they normalize for library size. Your best bet would be to reanalyze the data if they don't have raw counts.

ADD REPLY • link 3.6 years ago by rpolicastro 13k

0

Entering edit mode

Thank you for your answer.

Acctually the data is already normalized by EdgeR. So it is normalized by library size. The only difference is that they provided the CPM table, not the raw table.

Even so should I reanalize the data?

Thanks again

ADD REPLY • link 3.6 years ago by silas008 ▴ 160

0

Entering edit mode

Hi,

If you have CPM normalized counts from edgeR, then it should be already normalized for library size. You do not need to again calculate calcNormFactors you can see here https://reneshbedre.github.io/blog/expression_units.html#tmm-trimmed-mean-of-m-values

ADD REPLY • link 3.6 years ago by Renesh ★ 2.2k

0

Entering edit mode

Yes you should reanalyze the data. CPM is considered as simple summary stats, but calcNormFactros() does much more. This is what cpm() do:

CPM or RPKM values are useful descriptive measures for the expression level of a gene.

Compare to what calcNormFactros() do:

ThecalcNormFactorsfunction normalizes the library sizes by finding a set of scaling factorsfor the library sizes that minimizes the log-fold changes between the samples for most genes.The default method for computing these scale factors uses a trimmed mean of M-values(TMM) between each pair of samples

Indeed TMM from calcNormFactros() output is a sort of between-sample normalization method which is very important for differential expression analysis, while CPM provides within-sample normalization stats.

ADD REPLY • link 3.6 years ago by Hamid Ghaedi 3.2k

0

Entering edit mode

The data was normalized using calcNormFactors(). That is, the CPM is based on normalized values. The only difference is that I do not have the raw data to perform the whole process, only the CPM table.

In short, what I want to know is how to use teh CPM table in EdgeR. Is it possible? Can I simply use the basic code that I wrote above.

Thank you again

ADD REPLY • link 3.6 years ago by silas008 ▴ 160

score 4 · Answer 1 · 2020-09-28

Some of the above comments are misleading because OP is asking whether they can use the normalized counts for differential expression, not how to obtain CPM. The answer is no because edgeR models the raw counts and uses normalization factors as offsets for its GLM. It does not use any normalized counts directly (same goes for DESeq2). Therefore you should not (while technically possible) feed anything but raw counts into edgeR. If you deviate from the recommended standard workflow you might get suboptimal or flawed results, therefore it is strongly recommended not to do that. If data are from GEO it is possible to get raw counts. Download fastq files, e.g. via links provided by sra-explorer, then use a lightweight quantifier such as salmon or kallisto, and then aggregate these transcript level counts to the gene level with tximport. That is admittedly a bit of work but you can be sure that you follow the recommendations of the tools authors to exclusively use raw counts, and not and custom approaches.

Also note that the exact test is not the current recommendation of the edgeR authors. The preferred pipeline (by best knowledge) is the QLF framework, as outlined in the edgeR manual. For some inspiration you can check point three here Basic normalization, batch correction and visualization of RNA-seq data but be sure to check the manual as this is the official reference. For this see also QLF-test vs exact test.