Question: how to normalize row counts before drawing heatmap?
8 months ago by
smyiz20
smyiz20 wrote:

I have raw counts and edgeR differential expression results and want to draw a heatmap with logFC value. I have 12 groups, two cell lines with triplicate total and IP.

``````(cell1T1, cell1T2, cell1T3, cell1IP1, cell1IP2, cell1IP3,
cell2T1, cell2T2, cell2T3, cell2IP1, cell2IP2, cell2IP3)
``````

I want to normalize the count data by calculating scaling factor, cpm and fold change (Ip/total). My R script:

`````` cs = colSums(count)
scale_factor <-  1e6 / colSums(count)
scale_factor
data = t( t(count)/cs) * 1e6
cs2 = colSums(data)
cs2

> cs = colSums(count)
cell1T1     cell1T2    cell1T3      cell1IP1     cell1IP2     cell1IP3
9061105     6832076    1472003      12019856     5921757      2835648
cell2T1     cell2T2    cell2T3      cell2IP1     cell2IP2     cell2IP3
4696948     4387729    3907566      7580533      14312254     19052159

> scale_factor <-  1e6 / colSums(count)
> scale_factor

cell1T1     cell1T2    cell1T3      cell1IP1     cell1IP2     cell1IP3
0.11036182  0.14636840 0.67934644   0.08319567   0.16886880   0.35265308
cell2T1     cell2T2    cell2T3      cell2IP1     cell2IP2     cell2IP3
0.21290421  0.22790833 0.25591378   0.13191685   0.06987020   0.05248749

> data = t( t(count)/cs) * 1e6
> cs2 = colSums(data)
> cs2

cell1T1     cell1T2    cell1T3      cell1IP1     cell1IP2     cell1IP3
1e+06       1e+06      1e+06        1e+06        1e+06        1e+06
cell2T1     cell2T2    cell2T3      cell2IP1     cell2IP2     cell2IP3
1e+06       1e+06      1e+06        1e+06        1e+06        1e+06
``````

All columns sum to 1e6 (1 million). Does it show cpm value? After that how can I find fold changes between IP and totals?

Hi,

You can apply z-score standardization on edgeR normalized counts.

You may apply R script to transpose data and perform scale function to calculate z-score gene-wise, later re-transpose data as follows:

``````z_edgeRnormcounts = t(scale(t(edgeRnormcounts), center = TRUE, scale = TRUE))
``````

These z-score you can use to plot heatmap for your gene of interest.

8 months ago by
h.mon29k
Brazil
h.mon29k wrote:

Usually the packages used to analyse differential expression separate exploratory analyses (such as clustering, PCA, heatmaps, etc) from the actual differential expression testing.

edgeR provides the `cpm( )` function, which produces moderated log2-counts-per-million from the raw counts. If you pass a DGEList object to `cpm( )`, it will use the normalized library sizes in the calculations, if you pass a matrix (and set `cpm( count, log = FALSE )`, then I think the result will be the same as yours above. You can probably use the `cpmByGroup( )` function to calculate fold-changes, but this is not the preferred method.

In edgeR, the differential expression testing - including fold-change estimation - is performed on untransformed counts. There are several methodologies for DE modeling and testing in edgeR (such as `glmQLFit()` / `glmQLFTest( )`, `glmFit( )` / `glmLRT( )`, and others), then one extracts the fold-changes from these results.