Question

How to "guess" the transformation based on already-transformed, "normalized count data"?

0

Entering edit mode

12 months ago

wangziwei0010 ▴ 30

Thanks for your attention,

TLDR:

The minimum value in a transformed count matrix is -2.57. How can I guess what transformation was applied?
Any good advice on performing differential gene analysis on such transformed data?

Details:

I would like to analyze RNA data, but the data is controlled. So I downloaded the processed data from the original paper.
According to the authors, the data was processed using "The R-packages, tximport and edgeR, were used to respectively summarize the expression at gene-level and normalize the data."
I found that the maximum value was around 15 so I suspect the data was log-transformed.
Besides, the minimum value was -2.57, which appeared 310861 times in the 20453x96 matrix, with a frequency of 15.8%.

FYI:

Here is the paper: https://www.nature.com/articles/s41467-020-18640-0
the cpm function in edgeR has a default base of 2 and prior.count of 2.
A snapshot of the data:

edger rnaseq rna rna-seq • 971 views

ADD COMMENT • link 12 months ago by wangziwei0010 ▴ 30

0

Entering edit mode

email the corresponding author of the paper

ADD REPLY • link 12 months ago by jv ★ 1.8k

0

Entering edit mode

I have previously emailed the original author to request the raw data (which they cannot share due to EU regulations), but I would like to refrain from bothering them again unless absolutely necessary out of courtesy. Thank you for your attention and guidance.

ADD REPLY • link 12 months ago by wangziwei0010 ▴ 30

1

Entering edit mode

12 months ago

LChart 3.9k

Assuming a transform of the format:

y = log(a + x/b)

then:

min(y) ~ log(a)

min(y[y>min(y)]) ~ log(a + 1/b)

Unfortunately I think you just have to make assumptions about the base of the log.

ADD COMMENT • link 12 months ago by LChart 3.9k

0

Entering edit mode

Gordon Smyth gave his/her speculation based on original log base and prior count, which sounds reasonable. Thank you for your time and guidance!

ADD REPLY • link 12 months ago by wangziwei0010 ▴ 30

score 7 · Accepted Answer · 2023-04-05

It appears likely that the values are log2-CPM values produced by edgeR::cpm() with log=TRUE. The smallest value that would be returned by that function is equal to log2(2/L) where L is the average normalized library size in millions. It is entirely possible that the average normalized library size for this study would be around 11.9 million, so the smallest log2CPM value would be

> log2( 2 / 11.9 )
[1] -2.57

which is what you have. You would get this minimum value whenever the original count was zero.

The authors say they produced normalized counts using edgeR. The only functions provided by edgeR for exporting normalized counts are cpm() and rpkm(). The values you show are compatible with cpm but not with rpkm so the conclusion would have to be that they are log2CPM values.

You can perform differential expression analyses of log2CPM values using limma-trend. That won't be exactly the same as performing a differential analysis using the original counts, but still very good. If the library sizes are reasonably consistent, as they probably are for this study, then limma-trend has essentially the same power and FDR performance as a quasi-likelihood analysis in edgeR.