Question

JunctionSeq: are co-variate terms used in the calculation of effect size using linear contrasts?

2

Entering edit mode

7.3 years ago

unsupervised_learner ▴ 20

I am using JunctionSeq to test for differential exon and splice junction usage in a paired-end RNA-seq experiment and have a question about how the package computes effect sizes using linear contrasts.

Experimental design: my main condition variable is a drug treatment with three levels: vehicle, acute drug, chronic drug. However, to acquire the RNA I use an affinity purification that generates two biochemical fractions: "input" and "bound" RNA. Input is RNA from the whole tissue before affinity purification, bound is RNA enriched during affinity purification. As a note, I have enough biological replicates to calculate the dispersion.

My question: I am interested in the drug effect, but specifically as it co-varies with the biochemical fraction as there is variation in the dissection and biochemistry that is captured by the fraction variable. Thus, in my final model, I have drug treatment (condition) and fraction (co-variate). I have incorporated the co-variate into my GLM test formulas exactly as described in Section 8.2 of the user manual (p. 40-41, Advanced Generalized Linear Modelling). substituting my fraction variable for the Smoker variable. In Section 7.5 of the user manual (p. 37, Parameter Estimation), it states that linear contrasts using the Beta parameter estimates are used to the calculate the effect size (fold-change), and that co-variates can be incorporated during the process. So my main question is the following: By default, if I run a JunctionSeq analysis with the appropriate test.formulae, are all parameter estimates used in the calculation of fold-change?

RNA-Seq splicing GLM JunctionSeq • 1.6k views

ADD COMMENT • link updated 7.3 years ago by Steven Lakin ★ 1.8k • written 7.3 years ago by unsupervised_learner ▴ 20

score 3 · Accepted Answer · 2017-01-12

Yes, most of these programs use a call to lm or glm (JunctionSeq uses glm negative binomial) in R as a subroutine, and those calls use the formula to build their models, so if it's in the formula, it will be used to calculate the eventual fold-change. Here's why:

The formula is translated into a model matrix and the data are fit to that matrix; if the covariate is in the model formula, then the data will be fit to that coefficient as well, as there is now a new column in the model matrix. This will affect the coefficient estimate for the main effect you wish to model, as some of the data are now explained by the fit to the covariate's coefficient, which detracts from the data used to fit the main effect (this is the desired effect in most situations). You can see this process for yourself by playing around with it a bit in R:

library(datasets)
data(mtcars)

# Using a single fixed effect
one.effect <- model.matrix(mpg ~ 1 + hp, data=mtcars)

one.effect
                    (Intercept)  hp
Mazda RX4                     1 110
Mazda RX4 Wag                 1 110
Datsun 710                    1  93
...
Ferrari Dino                  1 175
Maserati Bora                 1 335
Volvo 142E                    1 109
attr(,"assign")
[1] 0 1

lm(mpg ~ 1 + hp, data=mtcars)

Call:
lm(formula = mpg ~ 1 + hp, data = mtcars)

Coefficients:
(Intercept)           hp  
   30.09886     -0.06823


# Using two fixed effects (second being the "covariate", which is somewhat of a misnomer)
two.effects <- model.matrix(mpg ~ 1 + hp + wt, data=mtcars)

two.effects
                         (Intercept)  hp    wt
Mazda RX4                     1 110 2.620
Mazda RX4 Wag                 1 110 2.875
Datsun 710                    1  93 2.320
...
Ferrari Dino                  1 175 2.770
Maserati Bora                 1 335 3.570
Volvo 142E                    1 109 2.780
attr(,"assign")
[1] 0 1 2

lm(mpg ~ 1 + hp + wt, data=mtcars)

Call:
lm(formula = mpg ~ 1 + hp + wt, data = mtcars)

Coefficients:
(Intercept)           hp           wt  
   37.22727     -0.03177     -3.87783

EDIT: For those who are interested in the the topic of regression-based modeling, I highly recommend the book "Data Analysis Using Regression and Multilevel/Hierarchical Models" by Andrew Gelman and Jennifer Hill. It presumes only an introductory understanding of statistics, covers most practical topics in regression, and has example R/BUGS code. For those who are further interested in modern Bayesian Gibbs samplers, I also recommend "Doing Bayesian Data Analysis" by John Kruschke, which focuses on R and Stan (a C-based Gibbs sampler that is highly efficient and versatile).