Question: Is There A Better Promoter Annotation Than Just Taking The 5-Prime Ends Of Ucsc Known Genes"?
2
gravatar for Ryan Thompson
7.0 years ago by
Ryan Thompson3.4k
TSRI, La Jolla, CA
Ryan Thompson3.4k wrote:

I am looking at ChIP-seq data by comparing read counts in promoter regions of genes. However, I wonder if there is a better way to define a promoter region than just assuming that the 5-prime end of each UCSC known gene annotation is the TSS and considering a radius of 1 or 2 kb around that coordinate.

annotation chip-seq • 3.9k views
ADD COMMENTlink written 7.0 years ago by Ryan Thompson3.4k
3
gravatar for Ido Tamir
7.0 years ago by
Ido Tamir5.0k
Austria
Ido Tamir5.0k wrote:

Why don't you like the easy solution?

  1. gene models are incomplete i.e. genes are missing (no readout)
  2. 5' end is not completely correctly mapped (incorrect readout)
  3. you look at a special tissue where you know certain annotated genes are not expressed or certain promoters not used
  4. you don't like the generic 1kb/2kb window

Points 1-3:

For getting a global picture, most of the available annotation for protein coding genes is sufficient (1-2). You also want to know about chromatin patterns in 3, where the gene is not expressed in your tissue. If you want to make your life complicated you could try looking at CAGE data and mRNA data (pol II) to map TSS more precisely. Data for humans and mouse or other organisms is available from ENCODE http://genome.ucsc.edu/ENCODE/ for certain cell lines/tissues. We http://www.nature.com/emboj/journal/v31/n14/full/emboj2012155a.html (mostly M. Jaritz) have painstaikinlgy better defined regulatory elements with these types of data for our cell system, mostly also because we wanted to describe/assign novel regulatory elements to give a complete picture. Others http://mpromdb.wistar.upenn.edu/index.html do this on a regular basis with public data.

Point 4:

Plot the profiles in different windows, check where the action is.

ADD COMMENTlink modified 7.0 years ago • written 7.0 years ago by Ido Tamir5.0k

I guess my main worry is that for protein-coding genes, the annotation for many genes might be based only on bioinformatic detection of the CDS, so the 5' end might be the start codon rather than the TSS. I guess this corresponds to your second point: incorrectly mapped 5' end.

ADD REPLYlink written 7.0 years ago by Ryan Thompson3.4k

One could go through the encode data and look for how often CAGE overlaps with current annotated TSS. I don't have data at hand, but I think in most of the cases its overlapping +/- 20 nt. I might misunderstand you, but the normal gene annotation pipelines (ucsc, ensembl) are based on experimental evidence (i.e. refseq or other cDNA libraries) and they have separate tracks for pure bioinformatical gene predictions (http://genomewiki.ucsc.edu/index.php/KnowngenesIII). So unless the mRNA was incompletely converted into cDNA (as could be especially on the 5' end) it will be correct. But for standard model organisms, this will almost never be the case now (with all the transcriptome etc.. data available).

ADD REPLYlink written 7.0 years ago by Ido Tamir5.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1235 users visited in the last hour