Sorry I asked this question in replying to a reply to my post on Biostars (multivariate analysis of RNA-seq) this morning, but I feel it necessary to pull it up so that because it's more specific than a generic multivariate regression.
In edgeR, given the following data table:
GENE EXPRESSION DISEASE A B
1 A1BG 0.4785665 1 0 0
2 A1BG -2.0000000 1 0 0
...
610683 ZZZ3 1.903144 0 0 1
610684 ZZZ3 1.959089 0 0 1
A: concentration of compound A (continuous var)
B: concentration of compound B (continuous var)
DISEASE: 0/1 whether it's a disease or normal sample
If what I care is to do the following "glm":
EXPRESSION ~ intercept + DISEASE + A + B
Then,
1) How should I define "group" in edgeR? DISEASE can be grouped, but A or B cannot because they are continuous
2) coef = 3?
3) Should I use contrast of c(0, 1, 1, 1)?
In effect, I don't know much about "coef = x" or why one needs to "group" or specify "contrast". If we can assume EXPRESSION or log(EXPRESSION) is normally distributed, then in R we can simply do
glm(EXPRESSION ~ DISEASE + A + B)
Don't know why edgeR is so (unnecessarily?) complicated.
It would be much appreciated if someone could post the actual edgeR code, or just enlighten me with some teaching.
Thanks much!
Unfortunately, I am supposed to be the local "statistician" or "bioinformatician".
Could you please outline the edgeR code? Does not have to be exact, just the major points.
Thanks much!
What's the biological question you want to ask?
Which factors (disease, compound A, B) have impact on which gene expression? The disease or normal samples have been treated with combinations of varying concentrations of A & B.
Then you'll need a separate adjusted p-value for each of the coefficients (2-4).
Brief R code, please? I really cannot figure out heads or tails out of the "group", "chef", "design" stuff or edgeR.
Thanks much!
The design you want is presumably
~DISEASE + A + B
, in which casecoef=2
would be disease,coef=3
would mean A, andcoef=4
would be B.