At risk of possibly missing something fundamental about the UCSC browser and various annotations and thus looking like the foolish newb I am on Biostars, I will ask if anyone can shed light on a situation I encountered.
I am mapping ChIP-seq reads to Mouse RefSeq transcription start sites. As one source of the TSS annotations, I downloaded a list from the UCSC table browser, clicking the following options.
I then downloaded the list of TSSs. The first few lines are:
#name chrom strand txStart NM_001008533 chr1 - 134199214 NM_001039510 chr1 - 134199214 NM_001282945 chr1 - 134199214 NM_175642 chr1 - 25067475 NM_207653 chr1 + 58713285 NM_009805 chr1 + 58713285 NM_008922 chr1 - 33453807
If you go to the very first index listed on chromosome 1, the index is actually the transcription termination site for the minus strand gene NM_001008533, as indicated by the direction of the arrows for this gene?
There are other examples in this list, enough that, coupled with the proximity of some of these examples to other actual TSS's and K-means clustering that I identified a whole group of genes based on mapping ChIP-seq signal to TSSs that were in fact TSSs. This was complicated by nearby genes, oriented tail to head on the same strand as those genes with the TTS annotated as a TSS.
I may be missing something. In any case the scenario and any potential clarification may prove useful to a bench scientist like myself faced with some data analysis tasks. Further, it may be a word of caution about the nature of genome annotation. There were few enough instances that a composite plot or metagene analysis for many thousands of genes looked as one would expect, but clustering identified what at first glance was an interesting group. This group of genes survived futher analysis aimed at filtering out potential artifactual causes of this group. It wasn't until I started looking at a number of the genes one at a time on the browser that I saw what I just described.