I have chip-seq peaks and want to find genes to wich they annotate using homer annotatepeaks.pl. Homer only has pre-built UCSC versions of genome and anotation (not ensembl). As I'm working with ensembl files, I had to first load ensembl genome and annotation into homer:
loadGenome.pl -name GRCh38.92 -org human -fasta $GENOME -gtf $GTF -gid
and only then annotate genes:
annotatePeaks.pl $Peak GRCh38.92 -gtf $GTF > $OUT
Result was:
PeakID (cmd=annotatePeaks.pl /workdir/GATA_3TF_HDF_day2/macs2_peaks/FDR_0.05/GATA_3TF_day2_macsFDR_0.05_rep1_peaks.narrowPeak GRCh38.92 -gtf /out/Homo_sapiens.GRCh38.92.gtf) Chr Start End Strand Peak Score Focus Ratio/Region Size Annotation Detailed Annotation Distance to TSS Nearest PromoterID Entrez ID Nearest Unigene Nearest Refseq Nearest Ensembl Gene Name Gene Alias Gene Description Gene Type
GATA_3TF_day2_macsFDR_0.05_rep1_peak_2469 3 93470312 93470846 + 1074 NA Intergenic NA -373185 ENST00000515918
GATA_3TF_day2_macsFDR_0.05_rep1_peak_2286 21 36478108 36478442 + 712 NA intron (ENST00000399137, intron 1 of 2) NA 1815 ENST00000399137 23562 Hs.660278 NM_012130 ENSG00000159261 CLDN14 DFNB29 claudin 14 protein-coding
GATA_3TF_day2_macsFDR_0.05_rep1_peak_3073 5 131164383 131164849 + 682 NA intron (ENST00000506207, intron 1 of 2) NA 604 ENST00000504202
For the 2nd peak we got full info (including gene name, ensembl geneID), for 1st and 3rd peaks we only got info until 'Nearest Promoter ID' column. So I though only peak 2 could be annotated.
But as the number of 'fully anotated' peaks was small I decided to try another way. To use Homer pre-built UCSC genome/gene annotation and add 'chr' to chromossome field in Peaks file to have same chromossome format as UCSC gene annotation.
awk '{ if($1 !~ /^#/){print "chr"$0} else{print $0} }' $Peak > $Peak1
annotatePeaks.pl $Peak1 hg38 > $OUT
the result was many more peaks 'fully annotated':
PeakID (cmd=annotatePeaks.pl /Volumes/LaCie/Tania/ChIPseq_analysis_20180725/Results/GATA_3TF_day2/macs2_peaks/FDR_0.05/GATA_3TF_day2_macsFDR_0.05_rep1_peaks_sort.narrowPeak_chr hg38) Chr Start End Strand Peak Score Focus Ratio/Region SizAnnotation Detailed Annotation Distance to TSS Nearest PromoterID Entrez ID Nearest Unigene Nearest RefseqNearest Ensembl Gene Name Gene Alias Gene Description Gene Type
GATA_3TF_day2_macsFDR_0.05_rep1_peak_2469 chr3 93470312 93470846 + 1074 NA Intergenic SAR|Satellite|Satellite 503502 NM_001314077 5627 Hs.64016 NM_000313 ENSG00000184500 PROS1 PROS|PS21|PS22|PS23|PS24|PS25|PSA|THPH5|THPH6 protein S protein-coding
GATA_3TF_day2_macsFDR_0.05_rep1_peak_2286 chr21 36478108 36478442 + 712 NA intron (NM_144492, intron 1 of 2) intron (NM_144492, intron 1 of 2) 1815 NM_144492 23562 Hs.660278 NM_012130 ENSG00000159261 CLDN14 DFNB29 claudin 14 protein-coding
Which approach is correct the one that gives us more (2nd) or less annotated peaks (1st approach)?
Should I just try to convert ensembl Transcript ID in 'Nearest Promoter ID' column to ensembl Gene ID of all genes? In that case, should I use ensembl Transcript ID from 'Nearest Promoter ID' column or from 'Annotation' column? (in gene 3 they are different)