I am doing some analysis using RNA-Seq and ChIP-Seq data and I am playing with different ways to correlate gene expression and binding. other posts on this topic are here or here and another paper here
A big paper in genome biology, that I linked in my answer in the linked post, demonstrated that i.e. H3K79me2 has a relative contribution of ~20% (0.20) to the R^2 between observed and predicted gene expression.
However, I didn't want to go through the binning of the genome etc etc...so I tested a generalized linear model. glm allows to correlate a binary outcome (0-1 or bound-not bound) to continuous data.
So, to try it out, I downloaded the H3K79me2 (and cMyc) data + K562 RNA-Seq data, annotated ChIPSeq with ChIPSeeker (+/- 1kb from TSS) and built a glm with those data, assigning 1 to bound genes and 0 to not bound genes.
The results show a max predicted probability/response of 0.21...pretty close to what the paper found and interestingly it only happens with H3K79me2 within the promoter region...not in distal intergenic, not at the 3'UTR or 5'UTR or in exons.
More interestingly, I found that cMyc gives a predicted response of ~0.92 ONLY in the promoter region, with higher binding=higher expression.
For RNASeq data, I removed 0s + log2'd and ChIPseq data were annotated using protocol described in the ChIPSeeker manual
The code I used for the glm is:
k79_names <- data from ChIPSeeker with col1=gene name, col2=ID, col3=location
x <- as.numeric(k562) #mean expression
y <- ifelse(rownames(k562) %in% k79_names[which(k79_names[,3] == "Promoter"),1],1,0)
g_k562 <- glm(y~x, family=binomial)
summary(g_k562)
xweight <- seq(min(x), max(x),0.1)
yweight <- predict(g_k562,list(x=xweight),type="response")
max(yweight)
So, in my head those results suggest that the approach could be valid, but I'm not a statistician. So here's my question: Is there any reason why I should NOT use glm()
for this approach?
you are correct, logistic regression is the right term, thanks for the correction! and thanks for looking into it :)