Is There A Better Promoter Annotation Than Just Taking The 5-Prime Ends Of Ucsc Known Genes"?
Entering edit mode
10.2 years ago
Ryan Thompson ★ 3.6k

I am looking at ChIP-seq data by comparing read counts in promoter regions of genes. However, I wonder if there is a better way to define a promoter region than just assuming that the 5-prime end of each UCSC known gene annotation is the TSS and considering a radius of 1 or 2 kb around that coordinate.

chip-seq annotation • 5.4k views
Entering edit mode
10.2 years ago
Ido Tamir 5.2k

Why don't you like the easy solution?

  1. gene models are incomplete i.e. genes are missing (no readout)
  2. 5' end is not completely correctly mapped (incorrect readout)
  3. you look at a special tissue where you know certain annotated genes are not expressed or certain promoters not used
  4. you don't like the generic 1kb/2kb window

Points 1-3:

For getting a global picture, most of the available annotation for protein coding genes is sufficient (1-2). You also want to know about chromatin patterns in 3, where the gene is not expressed in your tissue. If you want to make your life complicated you could try looking at CAGE data and mRNA data (pol II) to map TSS more precisely. Data for humans and mouse or other organisms is available from ENCODE for certain cell lines/tissues. We (mostly M. Jaritz) have painstaikinlgy better defined regulatory elements with these types of data for our cell system, mostly also because we wanted to describe/assign novel regulatory elements to give a complete picture. Others do this on a regular basis with public data.

Point 4:

Plot the profiles in different windows, check where the action is.

Entering edit mode

I guess my main worry is that for protein-coding genes, the annotation for many genes might be based only on bioinformatic detection of the CDS, so the 5' end might be the start codon rather than the TSS. I guess this corresponds to your second point: incorrectly mapped 5' end.

Entering edit mode

One could go through the encode data and look for how often CAGE overlaps with current annotated TSS. I don't have data at hand, but I think in most of the cases its overlapping +/- 20 nt. I might misunderstand you, but the normal gene annotation pipelines (ucsc, ensembl) are based on experimental evidence (i.e. refseq or other cDNA libraries) and they have separate tracks for pure bioinformatical gene predictions ( So unless the mRNA was incompletely converted into cDNA (as could be especially on the 5' end) it will be correct. But for standard model organisms, this will almost never be the case now (with all the transcriptome etc.. data available).


Login before adding your answer.

Traffic: 1567 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6