Question: Is linearity maintained in linear regression of RNA-Seq?
gravatar for CY
8 months ago by
United States
CY370 wrote:

Linear regression is used in cell type decomposition (TIMER or CIBERSORT). I am wondering if the linearity assumption met during modeling.

  1. I imagine raw count is not eligible because different library size causes non-linearity. Log-transformed value probably cause non-linearity as well, right? So is there any kind of normalization may fit the linear assumption?

  2. Beside, unlike microarray, RNA-Seq library is 0-sum game which is non-linearity (although I don't why this cause non-linearity. This may cause some sort of dependency but the overall expression is still the weighted sum of expression of its conponent, right?)

rna-seq • 416 views
ADD COMMENTlink modified 8 months ago by Kevin Blighe47k • written 8 months ago by CY370

RNAseq data (raw counts) can be transformed for linear modeling. Try voom method on RNAseq data.

ADD REPLYlink written 8 months ago by cpad011211k

Even though logCPM (voom) transformed expression value maintains linearity, we still face 0-sum game issue which cause variables dependence and non-linearity, right. This issue is inherited in the raw data and I can't see any way to fix it.

ADD REPLYlink written 8 months ago by CY370
gravatar for Kevin Blighe
8 months ago by
Kevin Blighe47k
Kevin Blighe47k wrote:

RNA-seq count data is non-linear and more closely resembles a negative binomial / Poisson-like distribution. For example, running linear regression on RNA-seq counts, normalised or otherwise, is not a great idea. DESeq2 for example, fits a negative binomial regression line through the counts and usually derives its p-value via the Wald test applied to model terms.

If you are looking to use RNA-seq data for cell deconvolution, I would go about obtaining the normalised, transformed counts, such as logCPM (EdgeR), variance-stabilised (DESeq2), or regularised log (DESeq2) counts. In EdgeR, you may play around with the prior count that can be added to 0-count genes prior to transformation. DESeq2's transformations deal with these low count genes in its own way.

Personally, I would then obtain Z-scores from the transformed data and use those for deconvolution - this is more readily interpreted. For example, you could regard genes with Z>3 as being highly expressed / representative of a tissue / cell-type, et cetera.

ADD COMMENTlink modified 8 months ago • written 8 months ago by Kevin Blighe47k

Thanks Kevin. If I understand you correctly, the methods you suggest normalizes library size and make the counts more normally distributed.

However, The 0-sum game nature of RNA-Seq library still causes varibles dependency and non-linear relationship. Would it compromise linear model a lot?

ADD REPLYlink written 8 months ago by CY370

To go more in depth into the statistics of this, go to StackExchange (CrossValidated), or even Bioconductor support forum.

ADD REPLYlink written 8 months ago by Kevin Blighe47k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 562 users visited in the last hour