Question

Is linearity maintained in linear regression of RNA-Seq?

0

Entering edit mode

6.5 years ago

CY ▴ 750

Linear regression is used in cell type decomposition (TIMER or CIBERSORT). I am wondering if the linearity assumption met during modeling.

I imagine raw count is not eligible because different library size causes non-linearity. Log-transformed value probably cause non-linearity as well, right? So is there any kind of normalization may fit the linear assumption?
Beside, unlike microarray, RNA-Seq library is 0-sum game which is non-linearity (although I don't why this cause non-linearity. This may cause some sort of dependency but the overall expression is still the weighted sum of expression of its conponent, right?)

RNA-Seq • 3.0k views

ADD COMMENT • link updated 23 months ago by Kevin Blighe 89k • written 6.5 years ago by CY ▴ 750

0

Entering edit mode

RNAseq data (raw counts) can be transformed for linear modeling. Try voom method on RNAseq data.

ADD REPLY • link 6.5 years ago by cpad0112 21k

0

Entering edit mode

Even though logCPM (voom) transformed expression value maintains linearity, we still face 0-sum game issue which cause variables dependence and non-linearity, right. This issue is inherited in the raw data and I can't see any way to fix it.

ADD REPLY • link 6.5 years ago by CY ▴ 750

score 4 · Answer 1 · 2018-12-24

RNA-seq count data is non-linear and more closely resembles a negative binomial / Poisson-like distribution. For example, running linear regression on RNA-seq counts, normalised or otherwise, is not a great idea. DESeq2 for example, fits a negative binomial regression line through the counts and usually derives its p-value via the Wald test applied to model terms.

If you are looking to use RNA-seq data for cell deconvolution, I would go about obtaining the normalised, transformed counts, such as logCPM (EdgeR), variance-stabilised (DESeq2), or regularised log (DESeq2) expression levels. In EdgeR, you may play around with the prior count that can be added to 0-count genes prior to transformation. DESeq2's transformations deal with these low count genes in its own way.

Personally, I would then obtain Z-scores from the transformed data and use those for deconvolution - this is more readily interpreted. For example, you could regard genes with Z>3 as being highly expressed / representative of a tissue / cell-type, et cetera.