RNA Seq Time Regression
1
2
Entering edit mode
9.6 years ago
tucanj ▴ 100

I have tumor age (nominal variable, divided into age brackets), gene expression (in RPKM) and sex (binary) for many tumors in people at different ages (not paired..ie. not time series of same tumors but different tumors at each age). I want to find the genes that are most differentially expressed with age, controlling for sex. What would be the best way to do this?

More specifically:

  1. Do I need to transform the RPKM values to rank (as per this thread)? The difference is that they use regularized regression.
  2. Is a linear regression appropriate? This paper seems to suggest quantile regression is better (however there are other features of their algorithm, and their age is not ordinal) to linear regression which is used in much of the aging gene expression studies with microarray.
  3. Because I am repurposing data and do not know batch would it be appropriate to run sva (or something similar) on it?

Thanks!

RNA-Seq R • 3.8k views
ADD COMMENT
2
Entering edit mode
9.6 years ago

If possible, get the raw data, RPKMs have a LOT of issues.

  1. Depending on how complex you design becomes, it can become difficult to analyze rank transformed data.
  2. Give linear regression a try first, there are a lot more tools for it. I'd only deal with quantile regression if the results were unsatisfactory.
  3. Absolutely. You might also try just including the batches as a factor, though odds are good that SVA will work a bit better.
ADD COMMENT
0
Entering edit mode

If I could get read counts instead of RPKM, would that affect my analysis?

ADD REPLY
0
Entering edit mode

Quite likely, yes. There are a number of issues with using RPKMs and raw counts tend to make life much simpler.

ADD REPLY
0
Entering edit mode

Can you please elaborate? Would I do the linear regression on the counts instead of RPKM? Would I need to include other factors in the regression such as gene length?

ADD REPLY
1
Entering edit mode

If you can get the counts then you'll use a GLM rather than straight linear regression (just use the DESeq2, edgeR or limma/voom Bioconductor packages). There's no reason to include gene length in the design (at least unless samples are significantly biased differently by it, but that's pretty unusual).

ADD REPLY
0
Entering edit mode

From my research, I cannot find a way to use an ordinal variable in one of these packages and find the trend of a gene's expression without doing a comparison between two groups. The closest I think I could do would be Age 3 vs Age 2 and Age 2 vs Age 1 and then take the union of the differentially expressed genes. Is there a way to find the linear trend (ie Gene A increases with age)? Or can I just fit a negative binomial GLM in R, and adjust all the p values for FDR?

ADD REPLY
0
Entering edit mode

You just use it as a covariate, so something like

age<-c(rep(c(1:3), 3))
design <- ~ age + condition

The coefficient on age is then change per unit (so, year, month, etc.). You might be able to use an ordered factor too, I've never tried it and don't know how model.matrix() treats it.

ADD REPLY
0
Entering edit mode

Furthermore, would it be better to correct for sex using ComBat or as a factor in the linear regression?

ADD REPLY
1
Entering edit mode

Use a factor for sex rather than ComBat.

ADD REPLY

Login before adding your answer.

Traffic: 1569 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6