Question

Pairwise Analysis - I want to look at overlay between Pol II peaks and H3K4me1/H3K27Ac at intergenic regions

2

Entering edit mode

8.5 years ago

Vov ▴ 70

I am new to Bioinformatics, with minimal experience in Terminal or scripting languages. There is a large chance that I will not be able to convey what I want to accomplish, please ask questions so that I may clarify.

Statement of Problem: I have run ChIP-Seq experiments on several enhancers and protein markers (I will refer to all of these as marks). I have already gone ahead and aligned the data using Bowtie, and then called peaks using MACS2. As a follow-up to a previous paper published by another lab member I want to find the co-occupancy of three of these marks in relation to the transcription start site (TSS) to determine how many of these peaks overlap and then create a venn diagram to show these overlaps / intersections (I am not sure on the terminology). The problem is I am not exactly sure how to do this and after almost 8+ hours of attempts I have gotten absolutely no-where, though I will post my attempts.

For example, from the Pol II peaks I have identified at TSS or intergenic regions, now I would like to define overlays with other factors/marks. MY PEAK FILES ARE IN .BED FORMAT.

I want to look at overlay between Pol II peaks and H3K4me1/H3K27Ac at intergenic regions.

How do I accomplish my goals?

Previous Study Information: As mentioned earlier an ex-lab member wrote a paper and was able to achieve this. He left behind a python script which I will copy down in a bit, but I can't seem to get it to work, nor do I really want to. I don't want to blindly trust the output of the script without knowing how and what it is doing. The image below is from the ex-lab member's paper and is basically what I would like to accomplish:

I was able to run the Python script the ex-lab member used to generate these venn diagrams for Polymerase II only, and from what I was able to understand the script basically outputs a "total # of intersections" line that he then plugged into somewhere in order to generate these diagrams.

I have uploaded the script to github so that you may view it here: https://github.com/Vov-Bio/findcooccupancy.py/blob/master/Script

What I Have Done: I want to provide a bit of what I have done so far, not only to show that I have attempted to solve the problem, but so that someone can steer me in the right direction.

First thing I did was research and see if I could find any information on how to overlap my data. The only solid topic I could find was here: How to determine genome wide Co-occupancy between two transcription factors?

In this post someone someone says "you can make venn diagram to show common and unique peaks of two transcription factors and could also make a line plot around TSS showing average occupancy of both" using https://usegalaxy.org/

This defines what I want to do almost perfectly.

Other searches brought up two interesting options:

BEDtools
Homer

I attempted to use BEDtools multiintersectBED function but could not find the peak intersection counts that I wanted.

I then attempted to use Homer mergePeaks function to find peak intersection counts and got 'something', but I am almost positive it is incorrect.

Essentially I ran the following function:

mergePeaks -d 250 -matrix -ghist ~/data/TSSforAllAnnotatedGenes.txt ~/data/Pol-II-Chip.MACS2_summits.bed

and received

/User/data/TSSforAllAnnotatedGenes.txt    0    13062
/User/data/Pol-II-Chip.MACS2_summits.bed    11608    0

I assumed that 11,608 would be my Pol-II number for my venn diagram, but after attempting to run the same command for other peak files I am not so sure.

Any help is appreciated.

ChIP-Seq cooccupancy homer • 3.7k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.5 years ago by Vov ▴ 70

0

Entering edit mode

See if this online course on Bioconductor from Kasper Daniel Hansen can give you some idea on how to do the intersection. In particular check the AnnotationHub lecture for retrieving TSS and the GenomicRanges for the intersections. The video 12 and 13 of week 1 show a similar case than what you describe, at least as a start.

ADD REPLY • link 8.5 years ago by Giovanni M Dall'Olio 28k

Ram · Answer 1 · 2015-10-13

1

Entering edit mode

8.5 years ago

Alternative ▴ 270

Below is a simple python script that I wrote in order to do that. Needs some optimizations but it works.

Inspired by: https://pythonhosted.org/pybedtools/3-brief-examples.html and http://fouryears.eu/2012/10/13/venn-diagrams-in-python/

Hope this helps. Feel free to optimize it and share it

#!/usr/bin/env python

## R  Vennerable is very good but cannot handle intervals, like bed files. Here, we can do that with a combination of pybedtools,  matplotlib and matplotlib_venn

## TODO: pass all information as arguments (bed files, output file name, labels, plot title)

from pybedtools import BedTool, cleanup
from matplotlib import pyplot as plt
import numpy as np
from matplotlib_venn import venn3, venn3_circles
import sys

# outputfile = sys.argv[1]
# print "outputfile:", outputfile

outputfile  = "file.pdf"
plot_title  = "Overlapping (common) peaks"
plot_labels = ('a', 'b', 'c')

a = BedTool('a.bed')
b = BedTool('b.bed')
c = BedTool('c.bed')

n1   = (a-b-c).count() # unique to a
n2   = (b-a-c).count() # unique to b
n3   = (c-a-b).count() # unique to c
n12  = (a+b-c).count() # in a and b, not c
n13  = (a+c-b).count() # in a and c, not b
n23  = (b+c-a).count() # in b and c, not a
n123 = (a+b+c).count() # common to all

plt.figure(figsize=(3,3))

plt.rcParams['font.size'] = 8

## order of numbers: n1, n2, n12, n3, n13, n23, n123

v1 = venn3(subsets=(n1, n2, n12, n3, n13, n23, n123), set_labels = plot_labels)
plt.title(plot_title)
plt.savefig(outputfile)

# Cleaning temporary files
cleanup()

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 8.5 years ago by Alternative ▴ 270

0

Entering edit mode

Hey there, I love this quick script. Was able to get it up and running however I think these script only picks out peaks that overlap perfectly among all three files. I am hoping to find overlaps within 250 bp (125 upstream, 125 downstream). Any idea how I could code this into your script or is it even possible?

Also is there a way to do this so that it overlaps these peaks with the TSS of all genes? Lets assume I have a text document with the TSS for all annotated genes, and would like to find the overlapping peaks within 250 bp (125 up, 125 down) of three other bed files in relation to that, but I only want to show counts for those three bed files.

Hopefully it made sense.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.5 years ago by Vov ▴ 70

0

Entering edit mode

Hi,

how can I use your script for 2 or more than 3 bed files?

best

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 7.7 years ago by chamseddine.kifagi • 0

0

Entering edit mode

With BED files you should use BedTools or BedOps. There should be a valid option to do whatever you are trying to do with BED files.

ADD REPLY • link 7.7 years ago by GenoMax 141k

Ram · Answer 2 · 2015-10-13

I find Seqmonk to be a very helpful beginner tool in Bioinformatics.

This is an interactive genome browser which you can customise to your needs. Create a new file with the appropriate reference genome. Import each of your peak files as an Annotation file. File > ImportAnnotation > Text (Generic)

You might also want to import a data file. File > ImportData > FileType. At this point, it's not very important what the data is.

Now define your probes using one of your annotation/peak files e.g. Pol II > Data > Define Probes > Feature Probe Generator > Features to design around (Select from drop down list). Create Probes.

Read Count Quantitation.

Now you can start filtering for overlaps with other regions.

> Filter > Filter by features > overlapped by (mark of interest).

A filtered peak list will appear on the left hand menu. You might need to expand the views.

Continue sequential filtering to overlap with all the features you're interested in.

Just have a play with Seqmonk. It will do most of the things you need right now and it's very easy to learn.