Which tool can plot the enrichment of specified motif across ChIP-seq bed files?
1
2
Entering edit mode
9.0 years ago
xiaoyonf ▴ 60

Hi, I have two ChIP-seq bed files of same transcription factor. I want to compare one specific binding motif between these two ChIP-seq binding peaks, i.e. whether the motif is more enriched in the binding sites of one ChIP-seq vs. the other (under different culture condition)?

The plots I want to make is by X axis: "Distance from centre of ChIP region (bp)" and the Y axis: " No. motif / no. binding sites"

Thank you for your answers!

Xiaoyong

ChIP-Seq • 6.2k views
ADD COMMENT
1
Entering edit mode

R will plot signal. You just need to specify what signal you want to plot.

ADD REPLY
0
Entering edit mode

Thanks, Alex. My question is actually is which software can do the motif enrichment analysis for my purpose. i.e. estimate how many specific motifs per binding site surrounding the center of peak region (-500bp ~ 500bp). Then plot it to visualize the difference between different ChIP-seq data sets.

ADD REPLY
1
Entering edit mode

You could use FIMO to call binding sites, running the BED output through BEDOPS bedmap to count sites over 1kbase windows of interest. Then use R to plot.

ADD REPLY
0
Entering edit mode

Thanks, Alex. Now I have two bed files, one is the FIMO output bed file defining the specific TF binding regions; other one is the ChIP-seq peak bed file containing the intervals (~300bp) that is used for calculation -- how many binding motifs (also location) within each interval. I want to plot a histogram showing the distribution of the numbers of motifs across a 1Kb window of these centered peak intervals. According to your suggestion, could you write me a script that runs in bedops and R. Many thanks! -Xiaoyong

ADD REPLY
1
Entering edit mode

I'll do the simplest possible example for counting:

$ bedmap --count peaks.bed bindingSites.bed > numberOfBindingSitesAcrossPeaks.txt

This simply counts binding sites that overlap peaks. Note that overlapping peaks may result in double counts. (Peaks are not 1kb windows; I'm just showing the simplest use of the --count operator.)

To make a simple histogram, I think this would suffice (untested):

$ R
> t <- read.table("numberOfBindingSitesAcrossPeaks.txt", header=F)
> hist(t)

You could read ?hist in R for more details.

If you want a 1kb window around midpoints, you would calculate the midpoints of the peaks (say, with awk) and pipe that result into bedmap:

$ awk '{ midpoint = $2 + ($3-$2)/2; print $1"\t"midpoint"\t"(midpoint+1)}' peaks.bed | bedmap --count --range 500 - bindingSites.bed > numberOfBindingSitesAcross1kbWindowPeakMidpoints.txt
  1. Use - in bedmap to represent the standard output from awk (the peak midpoints).

  2. Add --range 500 to bedmap --count to generate a 1kb window around peak midpoints (500 bases up- and downstream of the midpoints). Counts are generated from the 1kb windows.

  3. Then make the histogram in R with the new counts and hist()

Note that overlapping peak windows may result in double counts.

To get the location of binding sites that overlap the midpoints, use --echo-map instead of --count, change the map delimiter to a newline, and pipe to sort-bed:

$ awk '{ midpoint = $2 + ($3-$2)/2; print $1"\t"midpoint"\t"(midpoint+1)}' peaks.bed | bedmap --echo-map --range 500 --multidelim '\n' - bindingSites.bed | sort-bed - > bindingSitesAcross1kbWindowPeakMidpoints.bed

You could bring this file into a UCSC Genome Browser instance, IGB, or other viewer.

ADD REPLY
0
Entering edit mode

Thank you for your prompt reply. Your commands work very well. However, I am sorry that I didn't express my question well. The plot I want to make is: X-axis is the distance to peak summit (i.e., -500 to 500bp), and the Y-axis is the "number of motifs per number of peaks". In that way, we can appreciate the change of motif density across the distance to the ChIP-seq summit in a 1kb window. Thank you! -Xiaoyong

ADD REPLY
1
Entering edit mode

Then you need to make a set of bins and count across them.

The following uses bedops --chop 10 to make 10-base wide bins across a 1kb-wide window, which is centered on each peak midpoint. You can change how large or how small the bins are by adjusting the --chop value.

The bins are each fed to bedmap --count to count the number of binding sites that overlap the bin:

$ awk '{ midpoint = $2 + ($3-$2)/2; print $1"\t"midpoint"\t"midpoint;}' peaks.bed > midpoints.txt
$ while read MIDPOINT; do echo -e "$MIDPOINT" | bedops --range 500 --chop 10 - | bedmap --echo --count --delim '\t' - bindingSites.bed | awk 'BEGIN{idx=0;}{print $0"\tbin-"idx++; }'; done < midpoints.txt > countsOfBindingSitesOverBins.bed

Each 1kb window will contain 100 10-base bins, so the file countsOfBindingSitesOverBins.bed will look something like:

chr1    100 110 0   bin-0
chr1    110 120 1   bin-1
chr1    120 130 0   bin-2
chr1    130 140 2   bin-3
...
chr1    1070    1080    0   bin-97
chr1    1080    1090    1   bin-98
chr1    1090    1100    1   bin-99
chr1    3300    3310    0   bin-0
chr1    3310    3320    0   bin-1
chr1    3320    3330    3   bin-2   
chr1    3330    3340    0   bin-3
chr1    3340    3350    2   bin-4
...

You have the coordinates of the bins in the first three columns, the count of binding sites across the bin in the fourth column, and the bin ID in the fifth column.

Each 1kb window will contain 100 10-base bins, so this output will have IDs that look like bin-0 through bin-99. A bin containing a peak midpoint should be labeled bin-50, in this case.

It's unlikely you care about the coordinates of the bins, so you can use cut to grab the fourth and fifth columns:

$ cut -f4,5 countsOfBindingSitesOverBins.bed > countsPerBin.txt

From here, you can use read.table() in R to read in the text file countsPerBin.txt. You can make a bar plot with ggplot2 or similar:

$ R
> t <- read.table("countsPerBin.txt", header=F, stringsAsFactors=T)
> names(t) <- c("count", "id")
> library(plyr)
> tt <- ddply(t, "id", summarize, mcount = mean(count))
> library(ggplot2)
> ggplot(tt, aes(x=factor(id), y=mcount)) + geom_bar(stat = "identity")

The x-axis shows the bin ID value and the y-axis shows the mean count of binding sites that overlap that bin.

ADD REPLY
0
Entering edit mode

Thank you so much for this comprehensive instruction. I will try and let you know if I have more questions. Best, -Xiaoyong

ADD REPLY
5
Entering edit mode
9.0 years ago
vivekbhr ▴ 690

You can use MEME suite for enrichment of a specific motif over the given background (http://meme.nbcr.net/meme/tools/ame).. To plot the enrichment with distance, you will have to bin your input sequences into categories (1kb, 2kb, 3kb from TSS etc.), compute enrichment over control and then plot it in R with bins on X and enrichment score on Y axis..

Hope this helps.

ADD COMMENT
0
Entering edit mode

Thank you, Vivek. I will try ...

ADD REPLY

Login before adding your answer.

Traffic: 2228 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6