I am looking at ChIP-seq data by comparing read counts in promoter regions of genes. However, I wonder if there is a better way to define a promoter region than just assuming that the 5-prime end of each UCSC known gene annotation is the TSS and considering a radius of 1 or 2 kb around that coordinate.
I guess my main worry is that for protein-coding genes, the annotation for many genes might be based only on bioinformatic detection of the CDS, so the 5' end might be the start codon rather than the TSS. I guess this corresponds to your second point: incorrectly mapped 5' end.
One could go through the encode data and look for how often CAGE overlaps with current annotated TSS. I don't have data at hand, but I think in most of the cases its overlapping +/- 20 nt. I might misunderstand you, but the normal gene annotation pipelines (ucsc, ensembl) are based on experimental evidence (i.e. refseq or other cDNA libraries) and they have separate tracks for pure bioinformatical gene predictions (http://genomewiki.ucsc.edu/index.php/KnowngenesIII). So unless the mRNA was incompletely converted into cDNA (as could be especially on the 5' end) it will be correct. But for standard model organisms, this will almost never be the case now (with all the transcriptome etc.. data available).