Setup

Using bioconda, you can install everything you need for this example with

conda install --channel bioconda pybedtools bedtools htslib matplotlib

Run

Run get-data.sh to download data from ENCODE.

bash get-data.sh

Input files have the following format (UCSC broadPeak and narrowPeak formats, which ar variants of BED format):

chr1    569797  570055  .       1000    .       38.118451       16.0    -1
chr1    724125  2647713 .       258     .       1.259053        11.2    -1
chr1    752542  752779  .       658     .       10.178273       1.9     -1

Then run binary_heatmaps.py to generate the plot, a summary file, and a directory of interval files for each class.

python binary_heatmaps.py

Output

`binary_heatmap.png`

Rows are genomic intervals (as output by bedtools multiinter); columns are input BED files; black indicates that factor was found in that genomic interval.

`class_counts.txt`

Summary of how many genomic intervals for each combinatorial class:

             LSD1: 16181
        LSD1,TAL1: 15120
             TAL1: 7989
  GATA1,LSD1,TAL1: 3009
       GATA1,LSD1: 654
            GATA1: 231
       GATA1,TAL1: 214

`intervals/*.bed`

For each of the above classes, a BED file of the indicated intervals. For example,

track name="LSD1_and_TAL1"
chr1    778211  778487
chr1    854053  854329
chr1    948500  948776
...

	import os
	from matplotlib import pyplot as plt
	from pybedtools.contrib import plotting
	import pybedtools
	import numpy as np


	# set up the order in which to plot the columns of the binary heatmap
	names, bts = zip(*[
	('GATA1', 'wgEncodeAwgTfbsSydhK562Gata1UcdUniPk.narrowPeak.gz'),
	('LSD1', 'wgEncodeBroadHistoneK562Lsd1Pk.broadPeak.gz'),
	('TAL1', 'wgEncodeAwgTfbsSydhK562Tal1sc12984IggmusUniPk.narrowPeak.gz'),
	])
	bts = [pybedtools.BedTool(i).sort() for i in bts]

	# set up the object by giving it a list of pybedtool.BedTool objects and a list
	# of names to use.
	b = plotting.BinaryHeatmap(
	bts=bts,
	names=names)

	# plot it
	b.plot()

	# write out how many genomic location of each class were identified
	with open('class_counts.txt', 'w') as fout:
	for cls, cnt in sorted(b.class_counts.items(), key=lambda x: x[1], reverse=True):
	fout.write('{0:>25}: {1:<15}\n'.format(cls, cnt))

	# write out the actual intervals from each class
	out_dir = 'intervals'
	if not os.path.exists(out_dir):
	os.makedirs(out_dir)

	for k, v in b.classified_intervals.items():
	label = k.replace(',', '_and_')
	v.cut([0, 1, 2]).saveas(os.path.join(out_dir, label + '.bed'), trackline='track name="%s"' % label)

	# save the figure
	fig = plt.gcf()
	fig.tight_layout()
	fig.savefig('binary_heatmap.png')
	plt.show()

	#!/bin/bash
	wget --no-clobber http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/wgEncodeAwgTfbsSydhK562Gata1UcdUniPk.narrowPeak.gz

	wget --no-clobber http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeBroadHistone/wgEncodeBroadHistoneK562Lsd1Pk.broadPeak.gz

	wget --no-clobber http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/wgEncodeAwgTfbsSydhK562Tal1sc12984IggmusUniPk.narrowPeak.gz