Question

Why Are Transcription Termination Sites (Tts) Included In The Results Of A Transcription Start Site (Tss) Query From Ucsc?

1

Entering edit mode

11.7 years ago

bede.portz ▴ 540

At risk of possibly missing something fundamental about the UCSC browser and various annotations and thus looking like the foolish newb I am on Biostars, I will ask if anyone can shed light on a situation I encountered.

I am mapping ChIP-seq reads to Mouse RefSeq transcription start sites. As one source of the TSS annotations, I downloaded a list from the UCSC table browser, clicking the following options.

Options selected

I then downloaded the list of TSSs. The first few lines are:

#name    chrom    strand    txStart
NM_001008533    chr1    -    134199214
NM_001039510    chr1    -    134199214
NM_001282945    chr1    -    134199214
NM_175642    chr1    -    25067475
NM_207653    chr1    +    58713285
NM_009805    chr1    +    58713285
NM_008922    chr1    -    33453807

If you go to the very first index listed on chromosome 1, the index is actually the transcription termination site for the minus strand gene NM_001008533, as indicated by the direction of the arrows for this gene?

Browser Screenshot of TTS annoated as a TSS?

There are other examples in this list, enough that, coupled with the proximity of some of these examples to other actual TSS's and K-means clustering that I identified a whole group of genes based on mapping ChIP-seq signal to TSSs that were in fact TSSs. This was complicated by nearby genes, oriented tail to head on the same strand as those genes with the TTS annotated as a TSS.

I may be missing something. In any case the scenario and any potential clarification may prove useful to a bench scientist like myself faced with some data analysis tasks. Further, it may be a word of caution about the nature of genome annotation. There were few enough instances that a composite plot or metagene analysis for many thousands of genes looked as one would expect, but clustering identified what at first glance was an interesting group. This group of genes survived futher analysis aimed at filtering out potential artifactual causes of this group. It wasn't until I started looking at a number of the genes one at a time on the browser that I saw what I just described.

chip-seq • 9.2k views

ADD COMMENT • link updated 11.7 years ago by Istvan Albert 102k • written 11.7 years ago by bede.portz ▴ 540

score 4 · Answer 1 · 2013-11-07

4

Entering edit mode

11.7 years ago

Istvan Albert 102k

The column naming is utterly misleading, what it will fetch is the start column of an interval file. But those are always specified in terms of the positive strand and it is the strand column that will tell you which side is the start site from transcriptional perspective.

What this means is that for all features on the minus strand the positions reported by txStart will be the termination sites of the actual transcript.

To convince yourself download both txStart and txEnd and note how txStart < txEnd regardless of the strand.

ADD COMMENT • link 11.7 years ago by Istvan Albert 102k

0

Entering edit mode

Thanks for the answers. I find this annotation annotation absurd.

Others new to this type of analysis may learn from how I overlooked this issue at first glance: Given that the factor I am looking at is largely found just downstream from TSS and because I restricted my analysis to a small region around the RefSeq TSS list, I was really only analyzing + strand genes. I discovered a class of genes by K-means that had the majority of the ChIP-seq signal upstream, rather than downstream, of the TSS. This turned out to be signal from nearby TSSs on the - strand mapping to TTSs also on the minus strand, that were actually present on my TSS list. This is basically transitive disaster, defined.

ADD REPLY • link 11.7 years ago by bede.portz ▴ 540

score 3 · Answer 2 · 2013-11-07

For genes on the - strand, the transcript start site is actually the termination site and the termination site is actually the start site. You'll also sometimes find that exon numbering is in the wrong order for genes on the - strand. This is both incorrect and annoying, of course. Thankfully, I've only seen this from UCSC, so I've learned to avoid it (their annotations are kind of a mess anyway, stick with Ensembl).