Question: intersect multiple bed files in R
0
gravatar for tobikenobi
3.7 years ago by
tobikenobi0
tobikenobi0 wrote:

Dear all,

I would like to match (intersect) a quite big set of chip-seq peak bed files with a single bed file containing transcriptional start sites (TSS) identified by cage (cap analysis of gene expression). By using a large cohort of transcription factors (TFs), I am hoping to generate a map of the TFs associated with my promoters.

Technically, I have my promoter bed file and currently about 80 chip-seq peak bed files. I would like to look at TF binding in close upstream proximity of my promoters (for instance within 500 bp upstream of the TSS). I read that bedtools could be used for this, but I am limited to a windows laptop and I am unsure if this would work. However, I know a little but of R and would therefore prefer to use any R package that could the job.

I would be very grateful for any hints how I could start the analysis.

Thank you very much!

Tobias

chip-seq R • 3.7k views
ADD COMMENTlink modified 3.7 years ago by Sirus790 • written 3.7 years ago by tobikenobi0
3
gravatar for Alex Reynolds
3.7 years ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:
  1. Windows is not a good platform for this line of work. However, if you must run Windows, to do this efficiently, install the BEDOPS toolkit in Cygwin. Or use VirtualBox and install CentOS or Ubuntu Linux in a virtual machine.

    Either way, you'll want to have some form of a Unix or Unix-like operating system on your computer.

  2. Depending on the naming scheme for your files, generate the multiset union of N peak files:

    $ bedops --everything peakA.bed peakB.bed ... peakN.bed > unioned.peaks.bed
    
  3. Map peaks to promoters:

    $ bedmap --echo --echo-map-id-uniq --delim '\t' promoters.bed unioned.peaks.bed > answer.bed
    

    This returns a list of promoters and the IDs of overlapping peaks, where there is one or more bases of overlap between peaks and promoters. If you want the entire peak, use --echo-map in place of --echo-map-id-uniq.

    Once you are familiar with how this works, steps 2 and 3 can be merged into one step:

    $ bedops --everything peakA.bed peakB.bed ... peakN.bed | bedmap --echo --echo-map-id-uniq --delim '\t' promoters.bed - > answer.bed
    

    This runs even faster and avoids generating intermediate files.

ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by Alex Reynolds31k

Thank you very much for your reply.

I have two questions before I will actually try your suggestion: 1. If I understand correctly, in step 2 all chip-seq peak files are merged into one big file (unioned.peaks.bed). If I find overlap of peaks with promoters in step 3 , how will I know from which TF peak file they come from? Will the output file state it? 2. How can I set a window (lets say 500 bp ) upstream of the promoter to look for TF / promoter overlap? As my promoter file is actually a TSS file, the width of the genomic locations listed is quite narrow (1-100 bp roughly).

Again, thank you very much for your help!

ADD REPLYlink written 3.7 years ago by tobikenobi0

Pre-process your peaks files and add their names to the ID column.

$ awk -vname="1" 'BEGIN { OFS="\t"; }{ print $1, $2, $3, name; }' peakFile1.bed > peakFile1.labeled.bed

Assuming your peak files are named sensibly, you could create a loop to rename them:

$ for idx in `seq 1 80`; do awk -vname="$idx" 'BEGIN { OFS="\t"; }{ print $1, $2, $3, name; }' peakFile$idx.bed > peakFile$idx.labeled.bed; done

Then use bedmap as described, using the labeled peak files instead of the original peak files.

If you want to set a 500 bp upstream window, add --range 500 to the given bedmap command. Overlaps will be reported 500 bases up- and downstream of promoter edges.

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Alex Reynolds31k

Great! Thank you very much for your help!

Best,

Tobias

ADD REPLYlink written 3.7 years ago by tobikenobi0
0
gravatar for krushnach80
3.7 years ago by
krushnach80850
krushnach80850 wrote:

I think you can use the bedtools intersect function ..

ADD COMMENTlink written 3.7 years ago by krushnach80850
0
gravatar for Sirus
3.7 years ago by
Sirus790
Boston/USA
Sirus790 wrote:

You can use the findOverlapsOfPeaks function from the ChIPpeakAnno package.

Just to make the code more easy to follow, I use the readBroadPeak function from the genomation package. But you don't need to use too many packages to do that.

Suppose you have 3 bed files in the boradPeak format you can do

bed1 =  readBroadPeak("bed1.broadPeak")
bed2 =  readBroadPeak("bed2.broadPeak")
bed3 =  readBroadPeak("bed3.broadPeak")

peaksList =  list(bed1 = bed1, bed2= bed2, bed3=bed3)

ovp = findOverlapsOfPeaks(peaksList)
head(ovp$peaklist[["bed1///bed2///bed3"]])
ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by Sirus790
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1276 users visited in the last hour