Question: Homer: annotate peaks with ensembl annotation (.gtf) file instead of UCSC anotation file?
0
gravatar for salamandra
11 months ago by
salamandra240
salamandra240 wrote:

I have chip-seq peaks and want to find genes to wich they annotate using homer annotatepeaks.pl. Homer only has pre-built UCSC versions of genome and anotation (not ensembl). As I'm working with ensembl files, I had to first load ensembl genome and annotation into homer:

loadGenome.pl -name GRCh38.92 -org human -fasta $GENOME -gtf $GTF -gid

and only then annotate genes:

annotatePeaks.pl $Peak GRCh38.92 -gtf $GTF > $OUT

Result was:

PeakID (cmd=annotatePeaks.pl /workdir/GATA_3TF_HDF_day2/macs2_peaks/FDR_0.05/GATA_3TF_day2_macsFDR_0.05_rep1_peaks.narrowPeak GRCh38.92 -gtf /out/Homo_sapiens.GRCh38.92.gtf)   Chr     Start   End     Strand  Peak Score      Focus Ratio/Region Size Annotation      Detailed Annotation     Distance to TSS Nearest PromoterID      Entrez ID       Nearest Unigene Nearest Refseq  Nearest Ensembl Gene Name       Gene Alias      Gene Description        Gene Type
GATA_3TF_day2_macsFDR_0.05_rep1_peak_2469       3       93470312        93470846        +       1074    NA      Intergenic      NA      -373185 ENST00000515918
GATA_3TF_day2_macsFDR_0.05_rep1_peak_2286       21      36478108        36478442        +       712     NA      intron (ENST00000399137, intron 1 of 2) NA      1815    ENST00000399137 23562   Hs.660278       NM_012130       ENSG00000159261 CLDN14  DFNB29  claudin 14      protein-coding
GATA_3TF_day2_macsFDR_0.05_rep1_peak_3073       5       131164383       131164849       +       682     NA      intron (ENST00000506207, intron 1 of 2) NA      604     ENST00000504202

For the 2nd peak we got full info (including gene name, ensembl geneID), for 1st and 3rd peaks we only got info until 'Nearest Promoter ID' column. So I though only peak 2 could be annotated.

But as the number of 'fully anotated' peaks was small I decided to try another way. To use Homer pre-built UCSC genome/gene annotation and add 'chr' to chromossome field in Peaks file to have same chromossome format as UCSC gene annotation.

awk '{ if($1 !~ /^#/){print "chr"$0} else{print $0} }' $Peak > $Peak1
annotatePeaks.pl $Peak1 hg38 > $OUT

the result was many more peaks 'fully annotated':

PeakID (cmd=annotatePeaks.pl /Volumes/LaCie/Tania/ChIPseq_analysis_20180725/Results/GATA_3TF_day2/macs2_peaks/FDR_0.05/GATA_3TF_day2_macsFDR_0.05_rep1_peaks_sort.narrowPeak_chr hg38)  Chr Start   End Strand  Peak Score  Focus Ratio/Region SizAnnotation    Detailed Annotation Distance to TSS Nearest PromoterID  Entrez ID   Nearest Unigene Nearest RefseqNearest Ensembl   Gene Name   Gene Alias  Gene Description    Gene Type
GATA_3TF_day2_macsFDR_0.05_rep1_peak_2469   chr3    93470312    93470846    +   1074    NA  Intergenic  SAR|Satellite|Satellite 503502  NM_001314077    5627    Hs.64016    NM_000313   ENSG00000184500 PROS1   PROS|PS21|PS22|PS23|PS24|PS25|PSA|THPH5|THPH6   protein S   protein-coding
GATA_3TF_day2_macsFDR_0.05_rep1_peak_2286   chr21   36478108    36478442    +   712 NA  intron (NM_144492, intron 1 of 2)   intron (NM_144492, intron 1 of 2)   1815    NM_144492   23562   Hs.660278   NM_012130   ENSG00000159261 CLDN14  DFNB29  claudin 14  protein-coding

Which approach is correct the one that gives us more (2nd) or less annotated peaks (1st approach)?

Should I just try to convert ensembl Transcript ID in 'Nearest Promoter ID' column to ensembl Gene ID of all genes? In that case, should I use ensembl Transcript ID from 'Nearest Promoter ID' column or from 'Annotation' column? (in gene 3 they are different)

annotatepeaks.pl homer • 763 views
ADD COMMENTlink modified 10 months ago • written 11 months ago by salamandra240
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1170 users visited in the last hour