Question

RNA Seq Time Regression

2

Entering edit mode

9.6 years ago

tucanj ▴ 100

I have tumor age (nominal variable, divided into age brackets), gene expression (in RPKM) and sex (binary) for many tumors in people at different ages (not paired..ie. not time series of same tumors but different tumors at each age). I want to find the genes that are most differentially expressed with age, controlling for sex. What would be the best way to do this?

More specifically:

Do I need to transform the RPKM values to rank (as per this thread)? The difference is that they use regularized regression.
Is a linear regression appropriate? This paper seems to suggest quantile regression is better (however there are other features of their algorithm, and their age is not ordinal) to linear regression which is used in much of the aging gene expression studies with microarray.
Because I am repurposing data and do not know batch would it be appropriate to run sva (or something similar) on it?

Thanks!

RNA-Seq R • 3.8k views

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by tucanj ▴ 100

Ram · Accepted Answer · 2014-09-30

2

Entering edit mode

9.6 years ago

Devon Ryan 104k

If possible, get the raw data, RPKMs have a LOT of issues.

Depending on how complex you design becomes, it can become difficult to analyze rank transformed data.
Give linear regression a try first, there are a lot more tools for it. I'd only deal with quantile regression if the results were unsatisfactory.
Absolutely. You might also try just including the batches as a factor, though odds are good that SVA will work a bit better.

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by Devon Ryan 104k

0

Entering edit mode

If I could get read counts instead of RPKM, would that affect my analysis?

ADD REPLY • link 9.6 years ago by tucanj ▴ 100

0

Entering edit mode

Quite likely, yes. There are a number of issues with using RPKMs and raw counts tend to make life much simpler.

ADD REPLY • link 9.6 years ago by Devon Ryan 104k

0

Entering edit mode

Can you please elaborate? Would I do the linear regression on the counts instead of RPKM? Would I need to include other factors in the regression such as gene length?

ADD REPLY • link 9.6 years ago by tucanj ▴ 100

1

Entering edit mode

If you can get the counts then you'll use a GLM rather than straight linear regression (just use the DESeq2, edgeR or limma/voom Bioconductor packages). There's no reason to include gene length in the design (at least unless samples are significantly biased differently by it, but that's pretty unusual).

ADD REPLY • link 9.6 years ago by Devon Ryan 104k

0

Entering edit mode

From my research, I cannot find a way to use an ordinal variable in one of these packages and find the trend of a gene's expression without doing a comparison between two groups. The closest I think I could do would be Age 3 vs Age 2 and Age 2 vs Age 1 and then take the union of the differentially expressed genes. Is there a way to find the linear trend (ie Gene A increases with age)? Or can I just fit a negative binomial GLM in R, and adjust all the p values for FDR?

ADD REPLY • link 9.6 years ago by tucanj ▴ 100

0

Entering edit mode

You just use it as a covariate, so something like

age<-c(rep(c(1:3), 3))
design <- ~ age + condition

The coefficient on age is then change per unit (so, year, month, etc.). You might be able to use an ordered factor too, I've never tried it and don't know how model.matrix() treats it.

ADD REPLY • link 9.6 years ago by Devon Ryan 104k

0

Entering edit mode

Furthermore, would it be better to correct for sex using ComBat or as a factor in the linear regression?