Question: intersect multiple bed files in R
gravatar for tobikenobi
3.0 years ago by
tobikenobi0 wrote:

Dear all,

I would like to match (intersect) a quite big set of chip-seq peak bed files with a single bed file containing transcriptional start sites (TSS) identified by cage (cap analysis of gene expression). By using a large cohort of transcription factors (TFs), I am hoping to generate a map of the TFs associated with my promoters.

Technically, I have my promoter bed file and currently about 80 chip-seq peak bed files. I would like to look at TF binding in close upstream proximity of my promoters (for instance within 500 bp upstream of the TSS). I read that bedtools could be used for this, but I am limited to a windows laptop and I am unsure if this would work. However, I know a little but of R and would therefore prefer to use any R package that could the job.

I would be very grateful for any hints how I could start the analysis.

Thank you very much!


chip-seq R • 2.7k views
ADD COMMENTlink modified 2.9 years ago by Sirus780 • written 3.0 years ago by tobikenobi0
gravatar for Alex Reynolds
3.0 years ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:
  1. Windows is not a good platform for this line of work. However, if you must run Windows, to do this efficiently, install the BEDOPS toolkit in Cygwin. Or use VirtualBox and install CentOS or Ubuntu Linux in a virtual machine.

    Either way, you'll want to have some form of a Unix or Unix-like operating system on your computer.

  2. Depending on the naming scheme for your files, generate the multiset union of N peak files:

    $ bedops --everything peakA.bed peakB.bed ... peakN.bed > unioned.peaks.bed
  3. Map peaks to promoters:

    $ bedmap --echo --echo-map-id-uniq --delim '\t' promoters.bed unioned.peaks.bed > answer.bed

    This returns a list of promoters and the IDs of overlapping peaks, where there is one or more bases of overlap between peaks and promoters. If you want the entire peak, use --echo-map in place of --echo-map-id-uniq.

    Once you are familiar with how this works, steps 2 and 3 can be merged into one step:

    $ bedops --everything peakA.bed peakB.bed ... peakN.bed | bedmap --echo --echo-map-id-uniq --delim '\t' promoters.bed - > answer.bed

    This runs even faster and avoids generating intermediate files.

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by Alex Reynolds29k

Thank you very much for your reply.

I have two questions before I will actually try your suggestion: 1. If I understand correctly, in step 2 all chip-seq peak files are merged into one big file (unioned.peaks.bed). If I find overlap of peaks with promoters in step 3 , how will I know from which TF peak file they come from? Will the output file state it? 2. How can I set a window (lets say 500 bp ) upstream of the promoter to look for TF / promoter overlap? As my promoter file is actually a TSS file, the width of the genomic locations listed is quite narrow (1-100 bp roughly).

Again, thank you very much for your help!

ADD REPLYlink written 3.0 years ago by tobikenobi0

Pre-process your peaks files and add their names to the ID column.

$ awk -vname="1" 'BEGIN { OFS="\t"; }{ print $1, $2, $3, name; }' peakFile1.bed > peakFile1.labeled.bed

Assuming your peak files are named sensibly, you could create a loop to rename them:

$ for idx in `seq 1 80`; do awk -vname="$idx" 'BEGIN { OFS="\t"; }{ print $1, $2, $3, name; }' peakFile$idx.bed > peakFile$idx.labeled.bed; done

Then use bedmap as described, using the labeled peak files instead of the original peak files.

If you want to set a 500 bp upstream window, add --range 500 to the given bedmap command. Overlaps will be reported 500 bases up- and downstream of promoter edges.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by Alex Reynolds29k

Great! Thank you very much for your help!



ADD REPLYlink written 3.0 years ago by tobikenobi0
gravatar for krushnach80
3.0 years ago by
krushnach80640 wrote:

I think you can use the bedtools intersect function ..

ADD COMMENTlink written 3.0 years ago by krushnach80640
gravatar for Sirus
2.9 years ago by
Sirus780 wrote:

You can use the findOverlapsOfPeaks function from the ChIPpeakAnno package.

Just to make the code more easy to follow, I use the readBroadPeak function from the genomation package. But you don't need to use too many packages to do that.

Suppose you have 3 bed files in the boradPeak format you can do

bed1 =  readBroadPeak("bed1.broadPeak")
bed2 =  readBroadPeak("bed2.broadPeak")
bed3 =  readBroadPeak("bed3.broadPeak")

peaksList =  list(bed1 = bed1, bed2= bed2, bed3=bed3)

ovp = findOverlapsOfPeaks(peaksList)
ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by Sirus780
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1792 users visited in the last hour