I am looking at ChIP-seq data by comparing read counts in promoter regions of genes. However, I wonder if there is a better way to define a promoter region than just assuming that the 5-prime end of each UCSC known gene annotation is the TSS and considering a radius of 1 or 2 kb around that coordinate.
Why don't you like the easy solution?
- gene models are incomplete i.e. genes are missing (no readout)
- 5' end is not completely correctly mapped (incorrect readout)
- you look at a special tissue where you know certain annotated genes are not expressed or certain promoters not used
- you don't like the generic 1kb/2kb window
For getting a global picture, most of the available annotation for protein coding genes is sufficient (1-2). You also want to know about chromatin patterns in 3, where the gene is not expressed in your tissue. If you want to make your life complicated you could try looking at CAGE data and mRNA data (pol II) to map TSS more precisely. Data for humans and mouse or other organisms is available from ENCODE http://genome.ucsc.edu/ENCODE/ for certain cell lines/tissues. We http://www.nature.com/emboj/journal/v31/n14/full/emboj2012155a.html (mostly M. Jaritz) have painstaikinlgy better defined regulatory elements with these types of data for our cell system, mostly also because we wanted to describe/assign novel regulatory elements to give a complete picture. Others http://mpromdb.wistar.upenn.edu/index.html do this on a regular basis with public data.
Plot the profiles in different windows, check where the action is.