Question: RNA Seq Time Regression
2
gravatar for tucanj
4.6 years ago by
tucanj70
Canada
tucanj70 wrote:

I have tumor age (nominal variable, divided into age brackets), gene expression (in RPKM) and sex (binary) for many tumors in people at different ages (not paired..ie. not time series of same tumors but different tumors at each age). I want to find the genes that are most differentially expressed with age, controlling for sex. What would be the best way to do this?

More specifically:

1) Do I need to transform the RPKM values to rank (as per this thread: How to normalize RPKM values to use in regression models ? )? The difference is that they use regularized regression.

2) Is a linear regression appropriate? This paper (http://www.biomedcentral.com/1471-2164/10/S3/S16) seems to suggest quantile regression is better (however there are other features of their algorithm, and their age is not ordinal) to linear regression which is used in much of the aging gene expression studies with microarray.

3) Because I am repurposing data and do not know batch would it be appropriate to run sva (or something similar) on it?

Thanks!

rna-seq R • 2.3k views
ADD COMMENTlink modified 4.6 years ago by Devon Ryan89k • written 4.6 years ago by tucanj70
2
gravatar for Devon Ryan
4.6 years ago by
Devon Ryan89k
Freiburg, Germany
Devon Ryan89k wrote:

If possible, get the raw data, RPKMs have a LOT of issues.

  1. Depending on how complex you design becomes, it can become difficult to analyze rank transformed data.
  2. Give linear regression a try first, there are a lot more tools for it. I'd only deal with quantile regression if the results were unsatisfactory.
  3. Absolutely. You might also try just including the batches as a factor, though odds are good that SVA will work a bit better.
ADD COMMENTlink written 4.6 years ago by Devon Ryan89k

If I could get read counts instead of RPKM, would that affect my analysis?

ADD REPLYlink written 4.6 years ago by tucanj70

Quite likely, yes. There are a number of issues with using RPKMs and raw counts tend to make life much simpler.

ADD REPLYlink written 4.6 years ago by Devon Ryan89k

Can you please elaborate? Would I do the linear regression on the counts instead of RPKM? Would I need to include other factors in the regression such as gene length?

ADD REPLYlink written 4.6 years ago by tucanj70
1

If you can get the counts then you'll use a GLM rather than straight linear regression (just use the DESeq2, edgeR or limma/voom Bioconductor packages). There's no reason to include gene length in the design (at least unless samples are significantly biased differently by it, but that's pretty unusual).

ADD REPLYlink written 4.6 years ago by Devon Ryan89k

From my research, I cannot find a way to use an ordinal variable in one of these packages and find the trend of a gene's expression without doing a comparison between two groups. The closest I think I could do would be Age 3 vs Age 2 and Age 2 vs Age 1 and then take the union of the differentially expressed genes. Is there a way to find the linear trend (ie Gene A increases with age)? Or can I just fit a negative binomial GLM in R, and adjust all the p values for FDR?

ADD REPLYlink written 4.6 years ago by tucanj70

You just use it as a covariate, so something like

age<-c(rep(c(1:3), 3))
design <- ~ age + condition

The coefficient on age is then change per unit (so, year, month, etc.). You might be able to use an ordered factor too, I've never tried it and don't know how model.matrix() treats it.

ADD REPLYlink written 4.6 years ago by Devon Ryan89k

Furthermore, would it be better to correct for sex using ComBat or as a factor in the linear regression?

ADD REPLYlink written 4.6 years ago by tucanj70
1

Use a factor for sex rather than ComBat.

ADD REPLYlink written 4.6 years ago by Devon Ryan89k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1733 users visited in the last hour