Question: glm() for correlating protein binding and gene expression?
1
gravatar for TriS
3.6 years ago by
TriS3.9k
United States, Buffalo
TriS3.9k wrote:

I am doing some analysis using RNA-Seq and ChIP-Seq data and I am playing with different ways to correlate gene expression and binding. other posts on this topic are here or here and another paper here . 

a big paper in genome biology , that I linked in my answer in the linked post, demonstrated that i.e. H3K79me2 has a relative contribution of  ~20% (0.20) to the R^2 between observed and predicted gene expression.

however, I didn't want to go through the binning of the genome etc etc...so I tested a generalized linear model. glm allows to correlate a binary outcome (0-1 or bound-not bound) to continuous data.

so...to try it out, I downloaded the H3K79me2 (and cMyc) data + K562 RNA-Seq data, annotated ChIPSeq with ChIPSeeker (+/- 1kb from TSS) and built a glm with those data, assigning 1 to bound genes and 0 to not bound genes.

the results show a max predicted probability/response of 0.21...pretty close to what the paper found and interestingly it only happens with H3K79me2 within the promoter region...not in distal intergenic, not at the 3'UTR or 5'UTR or in exons.

more interestingly, I found that cMyc gives a predicted response of ~0.92 ONLY in the promoter region, with higher binding=higher expression.

for RNASeq data, I removed 0s + log2'd and ChIPseq data were annotated using protocol described in the ChIPSeeker manual

the code I used for the glm is:

k79_names <- data from ChIPSeeker with col1=gene name, col2=ID, col3=location
x  <- as.numeric(k562) #mean expression
y <- ifelse(rownames(k562) %in% k79_names[which(k79_names[,3] == "Promoter"),1],1,0)
g_k562 <- glm(y~x, family=binomial)
summary(g_k562)
xweight <- seq(min(x), max(x),0.1)
yweight <- predict(g_k562,list(x=xweight),type="response")
max(yweight)

so...in my head those results suggest that the approach could be valid...but I'm not a statistician...so here's my question...is there any reason why I should NOT use glm() for this approach?

regression rna-seq chip-seq • 1.4k views
ADD COMMENTlink modified 3.6 years ago by Jean-Karim Heriche20k • written 3.6 years ago by TriS3.9k
2
gravatar for Jean-Karim Heriche
3.6 years ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche20k wrote:

The model you're using i.e. glm(y~x, family=binomial) is called logistic regression and is appropriate when the response is a binary variable as in your case. So unless I've missed something about the data, this is a perfectly valid approach.

ADD COMMENTlink written 3.6 years ago by Jean-Karim Heriche20k

you are correct, logistic regression is the right term, thanks for the correction! and thanks for looking into it :)

ADD REPLYlink written 3.6 years ago by TriS3.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1495 users visited in the last hour