Question

[edgeR Usage] How does edgeR handle multivariate gene expression analysis

2

Entering edit mode

7.1 years ago

moxu ▴ 510

Sorry I asked this question in replying to a reply to my post on Biostars (multivariate analysis of RNA-seq) this morning, but I feel it necessary to pull it up so that because it's more specific than a generic multivariate regression.

In edgeR, given the following data table:

GENE EXPRESSION DISEASE A B
1 A1BG  0.4785665       1   0   0
2 A1BG -2.0000000       1   0   0
...
610683 ZZZ3   1.903144       0   0   1
610684 ZZZ3   1.959089       0   0   1

A: concentration of compound A (continuous var)

B: concentration of compound B (continuous var)

DISEASE: 0/1 whether it's a disease or normal sample

If what I care is to do the following "glm":

EXPRESSION ~ intercept + DISEASE + A + B

Then,

1) How should I define "group" in edgeR? DISEASE can be grouped, but A or B cannot because they are continuous

2) coef = 3?

3) Should I use contrast of c(0, 1, 1, 1)?

In effect, I don't know much about "coef = x" or why one needs to "group" or specify "contrast". If we can assume EXPRESSION or log(EXPRESSION) is normally distributed, then in R we can simply do

glm(EXPRESSION ~ DISEASE + A + B)

Don't know why edgeR is so (unnecessarily?) complicated.

It would be much appreciated if someone could post the actual edgeR code, or just enlighten me with some teaching.

Thanks much!

RNA-Seq R next-gen • 3.8k views

ADD COMMENT • link 7.1 years ago by moxu ▴ 510

score 0 · Answer 1 · 2017-03-28

0

Entering edit mode

7.1 years ago

Devon Ryan 104k

You don't need binary groups, covariates are fine.
Depends on what you want to test. That's the whole point of being able to select the coef you want (equivalent to what glm() is doing) or a contrast (i.e., comparing two or more coefficients).
The values in your contrast vector should sum to 0. Given your design I don't think a contrast would make sense, but perhaps my guess is wrong.

What edgeR is doing is a fancier version of what glm is doing. You might want to collaborate with a local statistician or bioinformatician.

ADD COMMENT • link 7.1 years ago by Devon Ryan 104k

0

Entering edit mode

Unfortunately, I am supposed to be the local "statistician" or "bioinformatician".

Could you please outline the edgeR code? Does not have to be exact, just the major points.

Thanks much!

ADD REPLY • link 7.1 years ago by moxu ▴ 510

0

Entering edit mode

What's the biological question you want to ask?

ADD REPLY • link 7.1 years ago by Devon Ryan 104k

0

Entering edit mode

Which factors (disease, compound A, B) have impact on which gene expression? The disease or normal samples have been treated with combinations of varying concentrations of A & B.

ADD REPLY • link 7.1 years ago by moxu ▴ 510

0

Entering edit mode

Then you'll need a separate adjusted p-value for each of the coefficients (2-4).

ADD REPLY • link 7.1 years ago by Devon Ryan 104k

0

Entering edit mode

Brief R code, please? I really cannot figure out heads or tails out of the "group", "chef", "design" stuff or edgeR.

Thanks much!

ADD REPLY • link 7.1 years ago by moxu ▴ 510

0

Entering edit mode

The design you want is presumably ~DISEASE + A + B, in which case coef=2 would be disease, coef=3 would mean A, and coef=4 would be B.

ADD REPLY • link 7.1 years ago by Devon Ryan 104k

score 0 · Answer 2 · 2017-03-29

0

Entering edit mode

7.1 years ago

moxu ▴ 510

Very helpful. However, I am not there yet:

d<-read.table("myfile.txt", sep="\t", header=T);
sd<-subset(d, GENE=="A1BG");
attach(sd)
dsn<-model.matrix(~DISEASE+A+B, data=sd);
fit<-glmFit(EXPRESSION, dsn, coef=c(2,3,4))

And I got the following error:

Error in glmFit.default(EXPRESSION, dsn, coef = c(2, 3, 4)) : 

  Design matrix not of full rank.  The following coefficients not estimable:

 A B

coef=2 yielded the same error.

Thanks!

ADD COMMENT • link 7.1 years ago by moxu ▴ 510

0

Entering edit mode

You'll have to inspect your model matrix to see why it's not full rank. Nothing can get around that and you'd get the same error with glm() and all other RNAseq packages.

ADD REPLY • link 7.1 years ago by Devon Ryan 104k

0

Entering edit mode

You are right -- I used the wrong data file, although all the data files share the same format. With the correct data file, the error becomes:

> fit<-glmFit(EXPRESSION, dsn, coef=c(2,3,4))
Error in glmFit.default(EXPRESSION, dsn, coef = c(2, 3, 4)) : 
  No dispersion values provided.

Dispersion? I read the manual and it seems to come from factors, but A & B are continuous, and DISEASE although 0/1 binary can also be regarded as continuous. So how can I get the dispersion?

Feel that I am getting very close! :)

Thanks!

ADD REPLY • link 7.1 years ago by moxu ▴ 510

0

Entering edit mode

Please read the egeR user guide, you're missing a few steps.

ADD REPLY • link 7.1 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks for the suggestion. Spent a couple of days reading and exercising the user guide. Now my questions are rephrased and posted at [edgeR usage] Comparing categories of fewer input variables for different gene expression.

ADD REPLY • link 7.1 years ago by moxu ▴ 510