annotationPeaks.pl algorithm for defining promoter-Tss
1
0
Entering edit mode
2.4 years ago
yahelsal • 0

Hi, I'm using annotationPeaks.pl for extracting peaks annotations from some bed file. I assume that Homer defines a peak to be 'promoter-TSS' if the peak center coordinate is in the region [-1000bp, +100bp] (my assumption is probably wrong because it doesn't work like that). I have seen that there are peaks that are set to be 'promoter-TSS' so that the center of the peak is after the TSS and the distance is greater than 100. for example:

PeakID  Chr Start   End Strand  Peak Score  Focus Ratio/Region Size Annotation  Detailed Annotation Distance to TSS Nearest PromoterID  Entrez ID   Nearest Unigene Nearest Refseq  Nearest Ensembl Gene Name   Gene Alias  Gene Description    Gene Type
21904   chr7    8009075 8009574 +   0   NA  promoter-TSS (NR_110018)    promoter-TSS (NR_110018)    897 NM_138426   113263  Hs.131673   NM_138426   ENSG00000106415 GLCCI1  FAM117C|GCTR|GIG18|TSSN1    glucocorticoid-induced 1    protein-coding


I'm running the command:

annotatePeaks.pl filename.bed hg19 > filename.annotation.txt


My question is what is Homer's definition for a particular pick to be annotated as 'promoter-TSS'?

homer annotationPeaks.pl promoter-TSS TSS • 2.1k views
0
Entering edit mode

thanks a lot, it helped me understand better

2
Entering edit mode
2.4 years ago
2nelly ▴ 310

According to homer documentation:

The process of annotating peaks/regions is divided into two primary parts. The first determines the distance to the nearest TSS and assigns the peak to that gene. The second determines the genomic annotation of the region occupied by the center of the peak/region.

Distance to the nearest TSS

By default, annotatePeaks.pl loads a file in the "/path-to-homer/data/genomes/<genome>/<genome>.tss" that contains the positions of RefSeq transcription start sites. It uses these positions to determine the closest TSS, reporting the distance (negative values mean upstream of the TSS, positive values mean downstream), and various annotation information linked to locus including alternative identifiers (unigene, entrez gene, ensembl, gene symbol etc.). This information is also used to link gene-specific information (see below) to a peak/region, such as gene expression.

Genomic Annotation

To annotate the location of a given peak in terms of important genomic features, annotatePeaks.pl calls a separate program (assignGenomeAnnotation) to efficiently assign peaks to one of millions of possible annotations genome wide. Two types of output are provided. The first is "Basic Annotation" that includes whether a peak is in the TSS (transcription start site), TTS (transcription termination site), Exon (Coding), 5' UTR Exon, 3' UTR Exon, Intronic, or Intergenic, which are common annotations that many researchers are interested in. A second round of "Detailed Annotation" also includes more detailed annotation, also considering repeat elements and CpG islands. Since some annotation overlap, a priority is assign based on the following (in case of ties it's random [i.e. if there are two overlapping repeat element annotations]): TSS (by default defined from -1kb to +100bp) TTS (by default defined from -100 bp to +1kb) CDS Exons 5' UTR Exons 3' UTR Exons *CpG Islands *Repeats Introns Intergenic ** Only applicable for the "Detailed Annotation".

Although HOMER doesn't allow you to explicitly change the definition of the region that is the TSS (-1kb to +100bp), you can "do it yourself" by sorting the annotation output in EXCEL by the "Distance to nearest TSS" column, and selecting those within the range you are interested in.

Your peak, black line (YourSeq), is inside this range considering GLCCI1-DT (divergent transcript, NR_110018). However, homer is not taking into account the GLCCI1 annotation

0
Entering edit mode

thanks a lot, it helped me understand better