Question

Deleted:Homer: annotate peaks with ensembl annotation (.gtf) file instead of UCSC anotation file?

2

Entering edit mode

5.4 years ago

salamandra ▴ 550

I have chip-seq peaks and want to find genes to wich they annotate using homer annotatepeaks.pl. Homer only has pre-built UCSC versions of genome and anotation (not ensembl). As I'm working with ensembl files, I had to first load ensembl genome and annotation into homer:

loadGenome.pl -name GRCh38.92 -org human -fasta $GENOME -gtf $GTF -gid

and only then annotate genes:

annotatePeaks.pl $Peak GRCh38.92 -gtf $GTF > $OUT

Result was:

PeakID (cmd=annotatePeaks.pl /workdir/GATA_3TF_HDF_day2/macs2_peaks/FDR_0.05/GATA_3TF_day2_macsFDR_0.05_rep1_peaks.narrowPeak GRCh38.92 -gtf /out/Homo_sapiens.GRCh38.92.gtf)   Chr     Start   End     Strand  Peak Score      Focus Ratio/Region Size Annotation      Detailed Annotation     Distance to TSS Nearest PromoterID      Entrez ID       Nearest Unigene Nearest Refseq  Nearest Ensembl Gene Name       Gene Alias      Gene Description        Gene Type
GATA_3TF_day2_macsFDR_0.05_rep1_peak_2469       3       93470312        93470846        +       1074    NA      Intergenic      NA      -373185 ENST00000515918
GATA_3TF_day2_macsFDR_0.05_rep1_peak_2286       21      36478108        36478442        +       712     NA      intron (ENST00000399137, intron 1 of 2) NA      1815    ENST00000399137 23562   Hs.660278       NM_012130       ENSG00000159261 CLDN14  DFNB29  claudin 14      protein-coding
GATA_3TF_day2_macsFDR_0.05_rep1_peak_3073       5       131164383       131164849       +       682     NA      intron (ENST00000506207, intron 1 of 2) NA      604     ENST00000504202

For the 2nd peak we got full info (including gene name, ensembl geneID), for 1st and 3rd peaks we only got info until 'Nearest Promoter ID' column. So I though only peak 2 could be annotated.

But as the number of 'fully anotated' peaks was small I decided to try another way. To use Homer pre-built UCSC genome/gene annotation and add 'chr' to chromossome field in Peaks file to have same chromossome format as UCSC gene annotation.

awk '{ if($1 !~ /^#/){print "chr"$0} else{print $0} }' $Peak > $Peak1
annotatePeaks.pl $Peak1 hg38 > $OUT

the result was many more peaks 'fully annotated':

PeakID (cmd=annotatePeaks.pl /Volumes/LaCie/Tania/ChIPseq_analysis_20180725/Results/GATA_3TF_day2/macs2_peaks/FDR_0.05/GATA_3TF_day2_macsFDR_0.05_rep1_peaks_sort.narrowPeak_chr hg38)  Chr Start   End Strand  Peak Score  Focus Ratio/Region SizAnnotation    Detailed Annotation Distance to TSS Nearest PromoterID  Entrez ID   Nearest Unigene Nearest RefseqNearest Ensembl   Gene Name   Gene Alias  Gene Description    Gene Type
GATA_3TF_day2_macsFDR_0.05_rep1_peak_2469   chr3    93470312    93470846    +   1074    NA  Intergenic  SAR|Satellite|Satellite 503502  NM_001314077    5627    Hs.64016    NM_000313   ENSG00000184500 PROS1   PROS|PS21|PS22|PS23|PS24|PS25|PSA|THPH5|THPH6   protein S   protein-coding
GATA_3TF_day2_macsFDR_0.05_rep1_peak_2286   chr21   36478108    36478442    +   712 NA  intron (NM_144492, intron 1 of 2)   intron (NM_144492, intron 1 of 2)   1815    NM_144492   23562   Hs.660278   NM_012130   ENSG00000159261 CLDN14  DFNB29  claudin 14  protein-coding

Which approach is correct the one that gives us more (2nd) or less annotated peaks (1st approach)?

Should I just try to convert ensembl Transcript ID in 'Nearest Promoter ID' column to ensembl Gene ID of all genes? In that case, should I use ensembl Transcript ID from 'Nearest Promoter ID' column or from 'Annotation' column? (in gene 3 they are different)

homer annotatePeaks.pl • 3.1k views

ADD COMMENT • link 5.4 years ago by salamandra ▴ 550