Is there any way to query all human chip-seq data for transcription factor binding to a specific gene promoter. I would like to create a map of TFs that bind to a specific chromosomal region corresponding to the promoter of my gene of interest. For example, chip seq has been done for the VDR, STAT, CREB etc. How do I determine if VDR or these other factors are bound to the promoter of my gene of interest from chip seq data. I would like to use in silico cis-element predictors along with Chip-seq data to help us narrow functional TF-binding in the promoter. We would then confirm these sites with our own Chip analysis in the mouse models.
This may be overkill since you're only interested in one gene. If you have a bunch of BED files, one for each ChIP-seq experiment, you could put the appropriate TF name in the 4th column of a file, and use bedmap to map this information over to your region of interest. Example:
// in creb.bed chr1 4012 4103 CREB chr1 5500 5750 CREB ... chrY 3100000 3100050 CREB // in stat1.bed chr1 500 625 STAT1 ... // similar files, one for every other TF
Define your region of interest in a BED file. Say you are looking at the CTCF gene.
// in myfave.bed chr16 67596110 67596310 CTCF +
bedops -u creb1.bed stat1.bed ... | bedmap --echo --echo-map-id myfave.bed - > answer.bed
where you pass all/any number of TF files to the first bedops command. The bedmap command tells you which TFs are in your region of interest. If you only want distinct TFs (so that, for example, STAT1 doesn't show up 5 times), replace --echo-map-id with --echo-map-id-uniq. The output file would look something like:
// in answer.bed chr16 67596310 67673088 CTCF +|STAT1;VDR
where the stuff after the '|' symbol includes the TF information you want (all separated by semicolons). Keep in mind that you can put as many rows as you want in myfave.bed so this approach scales as needed. The only other requirement is that all of these BED files be properly sorted.
The bedmap program has other useful things it can report (all in one pass of the data) - perhaps you want the binding sites themselves and not just the IDs (see --echo-map). There are also multiple ways to specify what it means for an element to lay within your region of interest (ie; should a peak call have at least 50% of its genomic length lying inside what you define to be the promoter region? See --fraction-map and related overlap constraint options).