Correlate ChIP-seq to gene expression
1
0
Entering edit mode
15 months ago

I'm looking for a way to compare ChIP-seq signal to gene expression, and not 100% sure the best way to do this. To be specific, the end goal is to make a scatter plot with (normalized) RNA-seq counts on the X-axis and (normalized) ChIP-seq on the Y, and calculate a correlation coefficient.

I have a matrix of counts (FeatureCounts) from a large RNA-seq experiment, but I don't have access to the raw files. I also have BAM and peak files from several ChIP-seq experiments that I've done (in the same cell type). I'd like to see if the strength of the ChIP-seq signal within genes (as in the coverage in the BAM file or the peak score from the peak file), or possibly within X bp of the start of a gene, correlates to the RNA expression value of that gene. I can obviously normalize the RNA counts prior to comparing to ChIP-seq, but this is where I get lost, I'm not sure how I would do this analysis. One thing I had thought about was running the ChIP BAM files through FeatureCounts as well, but I'm not sure if that's the optimal way to do it, and even from there I'm a little confused about what I would do with that final matrix. Would it be more ideal if I could obtain the RNA BAM files?

I often use DeepTools when I want to find the correlation between two samples, but that's usually more straight-forward because I just use BAMs/bigwigs over bins or a consensus peakset, but using Count data is getting me confused! Thanks in advance for any suggestions/recommendations!

1
Entering edit mode

A lot of people associate a ChIP-seq peak with it's nearest gene. Obviously there are some major caveats to this (e.g. strand is completely ignored, internal genes like small RNA can be ignored, etc...). But why not try that? You have the position of your peaks and you can set thresholds like "only associate with nearest gene if it is 0bp (within a gene) or within 5kb, 10kb, 100k away etc... from the peak". Then just plot the read counts for the genes that have a peak associated with them and use the enrichment peak score (or whatever the delta is over the input).

Definitely a ton of room for error but you should go in and double check interesting/confusing things.

0
Entering edit mode

Thanks for your reply! So if I'm understanding what you mean, I think I've done that kind of analysis before. Annotated my ChIP-seq peaks with different thresholds (e.g. 300 bp from TSS, 1000bp from TSS etc), but where I get a little confused is how to combine the ChIP data with the RNA counts. If I had the RNA BAM files, what I might do is make a SAF of the ChIP data and then run FeatureCounts for the RNA over those regions, but I only have the RNA count matrix (and ideally I'd be able to correlate ChIP peak score with RNA FPKM or TMM, for example). Is there by any chance a program you know of that can do this, or an online tutorial of some sort? I'm imagining DeepTools won't work with a count matrix...

1
Entering edit mode

"how to combine the ChIP data with the RNA counts" you already have the gene expression, Now calculate the peak "expression" i.e signal from the Chip-Seq. Then associate peaks to genes. Then plot the 'scatterplot' of FPKM vs the normalized chip-seq signal on the gene associated peak. Other option is to make quantiles from gene expression and plot the chip-seq signal for the peaks associated to those genes in each bin. So it will be box-plots of chip-seq signals across quantiles of gene expression. If there is a correlation, it will be seen i.e genes in high expression bins get higher chip-seq signals and vice-versa.

0
Entering edit mode

Thanks, this makes sense. So to be clear, then do you mean I would use FeatureCounts on my ChIP-seq data so that I can get the coverage across genes? I can't think of another way to do it, since I imagine the scatterplot would require the same genomic ranges/genes for the X and Y axes, is that correct? I think I'm starting to understand this conceptually, but still a little lost about how to implement it in code...

0
Entering edit mode

Use FeatureCounts on your peaks. FeatureCounts works with SAF files, not only GTF files. MACS2 --> call peaks --> convert bed to SAF --> FeatureCounts.

Then ANNOTATE peaks to gene.

Peaks.bed
chr1    100    350
chr9    9889    9999
chr10    87876    8920


ANNOTATE the peaks

chr1    100    350    Gene1
chr9    9889    9999    NA
chr10    87876    8920    Gene2


Now you have two quantifications:

GegeID    FPKM    Promoter_ChiP
Gene1    10.9    32
Gene2    9.1    44


Seems like you are confused about annotating peaks with genes. Read about it.

You could also use FeatureCounts on genes (gene start + 2kb upstream, gene end coordinates), but that would include counting background reads if you dont know where the peaks (real signal) are. This is not recommended and probably doesn't work if you don't do it in correct way.

0
Entering edit mode

Ohhh, yeah annotation totally makes sense here, thank you! This completely clarifies it!!

0
Entering edit mode
15 months ago

This might not sound a straight answer to your question but I presume it will help in some way. You can first group your Chipseq peaks and Associated Gene Signals using InTAD package in R that groups the Peaks and Genes in Topologically Associated Domains. And then Calculate correlation between Chip and RNAseq signals to decipher possible interactions. It also furnishes the correlation coefficient and P values based on the type of correlation used. You can find this tutorial helpful.

I hope this helps.