Question: HOMER annotatePeaks - No Annotations
0
gravatar for josh.cutts1
13 months ago by
josh.cutts10 wrote:

I'm trying to recreate an analysis of ChIP-seq data from SRP051788.

I used bowtie2 for mapping and then I used HOMER to generate peak files and am using these to annotate using the hg19 genome using the following command:

annotatePeaks.pl peaks.txt hg19 > annotated_peaks.txt

It runs fine but when I look at the file there are no annotations and no gene IDs and I can't figure out why.

Here is the first couple lines of a peak file I used for annotation:

# HOMER Peaks
# Peak finding parameters:
# tag directory = Bcatenin_WNT3a/
#
# total peaks = 14292
# peak size = 200
# peaks found using tags on both strands
# minimum distance between peaks = 400
# fragment length = 156
# genome size = 2000000000
# Total tags = 15622676.0
# Total tags in peaks = 1025145.0
# Approximate IP efficiency = 6.56%
# tags per bp = 0.007238
# expected tags per peak = 1.448
# maximum tags considered per bp = 1.0
# effective number of tags used for normalization = 10000000.0
# Peaks have been centered at maximum tag pile-up
# number of putative peaks = 14620
#
# input tag directory = H1_Input/
# Fold over input required = 4.00
# Poisson p-value over input required = 1.00e-04
# Putative peaks filtered by input = 214
#
# size of region used for local filtering = 10000
# Fold over local region required = 4.00
# Poisson p-value over local region required = 1.00e-04
# Putative peaks filtered by local signal = 114
#
# Maximum fold under expected unique positions for tags = 2.00
# Putative peaks filtered for being too clonal = 0
#
# cmd = findPeaks Bcatenin_WNT3a/ -style factor -size 200 -tagThreshold 30 -o auto -i H1_Input/
#
# Column Headers:
#PeakID chr start   end strand  Normalized Tag Count    focus ratio findPeaks Score Total Tags (normalized to Control Experiment)   Control Tags    Fold Change vs Control  p-value vs Control  Fold Change vs Local    p-value vs Local    Clonal Fold Change
GL000220.1-1    GL000220.1  134369  134569  +   2451.6  0.854   374.000000  1985.5  44.0    45.12   0.00e+00    19.91   0.00e+00    0.54
GL000220.1-2    GL000220.1  124783  124983  +   905.7   0.760   370.000000  733.5   44.0    16.67   0.00e+00    9.49    0.00e+00    0.54
2-1 2   171571590   171571790   +   338.0   0.860   275.000000  273.7   0.5 547.43  0.00e+00    26.05   0.00e+00    0.67
1-2 1   33202423    33202623    +   275.2   0.806   246.000000  222.9   0.5 445.82  0.00e+00    83.94   0.00e+00    0.71
4-1 4   140545690   140545890   +   274.0   0.864   246.000000  221.9   0.5 443.75  0.00e+00    50.41   0.00e+00    0.70

I've tried adding 'chr' to the start of the chrom start column since it looked like that was the way the annotation file was set up but that didn't fix the problem. I've also tried to delete the header since I read that might fix the problem. a problem with peak annotation after ChIP-seq

I'll try some other annotation programs and to get peak files from MACS2 but since I am new to doing ChIP-seq analysis I was hoping to recreate some other researchers' pipelines before I get my own data back for analysis, so I'd like to try to get HOMER to work also.

chip-seq • 1.2k views
ADD COMMENTlink written 13 months ago by josh.cutts10

Your peaks don't appear to have many tags. Could you possibly retry findPeaks without specifying -size 200 and -tagThreshold 30 (but more importantly without -tagThreshold 30) ?

Also post some alignment stats when you aligned with, presumably, Bowtie, please.

ADD REPLYlink modified 13 months ago • written 13 months ago by Kevin Blighe39k
    > SRR1745494-2.sam
18783943 reads; of these:
  18783943 (100.00%) were unpaired; of these:
    1269356 (6.76%) aligned 0 times
    13428963 (71.49%) aligned exactly 1 time
    4085624 (21.75%) aligned >1 times
93.24% overall alignment rate

reran findPeaks without specifying -size or -tagThreshold 30

findPeaks Bcatenin_WNT3a/ -style factor -o auto -i H1_Input/

# HOMER Peaks
# Peak finding parameters:
# tag directory = Bcatenin_WNT3a/
#
# total peaks = 34902
# peak size = 164
# peaks found using tags on both strands
# minimum distance between peaks = 328
# fragment length = 156
# genome size = 2000000000
# Total tags = 15622676.0
# Total tags in peaks = 1311680.0
# Approximate IP efficiency = 8.40%
# tags per bp = 0.007238
# expected tags per peak = 1.187
# maximum tags considered per bp = 1.0
# effective number of tags used for normalization = 10000000.0
# Peaks have been centered at maximum tag pile-up
# FDR rate threshold = 0.001000000
# FDR effective poisson threshold = 5.227364e-07
# FDR tag threshold = 10.0
# number of putative peaks = 58996
#
# input tag directory = H1_Input/
# Fold over input required = 4.00
# Poisson p-value over input required = 1.00e-04
# Putative peaks filtered by input = 22711
#
# size of region used for local filtering = 10000
# Fold over local region required = 4.00
# Poisson p-value over local region required = 1.00e-04
# Putative peaks filtered by local signal = 1382
#
# Maximum fold under expected unique positions for tags = 2.00
# Putative peaks filtered for being too clonal = 1
#
# cmd = findPeaks Bcatenin_WNT3a/ -style factor -o auto -i H1_Input/
#
# Column Headers:
#PeakID chr start   end strand  Normalized Tag Count    focus ratio findPeaks Score Total Tags (normalized to Control Experiment)   Control Tags    Fold Change vs Control  p-value vs Control  Fold Change vs Local    p-value vs Local    Clonal Fold Change
GL000220.1-2    GL000220.1  134387  134551  +   2247.4  0.857   308.000000  1820.1  36.0    50.56   0.00e+00    21.62   0.00e+00    0.53
GL000220.1-1    GL000220.1  124801  124965  +   815.5   0.761   311.000000  660.4   38.0    17.38   0.00e+00    10.26   0.00e+00    0.53
2-1 2   171571608   171571772   +   307.2   0.862   243.000000  248.8   0.5 497.66  0.00e+00    27.65   0.00e+00    0.63

Then I reran annotatePeaks and still got no annotation/genes:

annotatePeaks.pl peaks.txt hg19 > bcatenin_WNT3a_peaks2_annotated.txt

PeakID (cmd=annotatePeaks.pl peaks.txt hg19)    Chr Start   End Strand  Peak Score  Focus Ratio/Region Size Annotation  Detailed Annotation Distance to TSS Nearest PromoterID  Entrez ID   Nearest Unigene Nearest Refseq  Nearest Ensembl Gene Name   Gene Alias  Gene Description    Gene Type
GL000220.1-2    GL000220.1  134387  134551  +   2247.4  0.857   NA  NA  NA  NA                              
GL000220.1-1    GL000220.1  124801  124965  +   815.5   0.761   NA  NA  NA  NA                              
2-1 2   171571608   171571772   +   307.2   0.862   NA  NA  NA  NA                              
14-1    14  42805239    42805403    +   259.9   0.961   NA  NA  NA  NA                              
15-1    15  93905470    93905634    +   256.0   0.903   NA  NA  NA  NA                              
1-3 1   33202441    33202605    +   251.6   0.812   NA  NA  NA  NA
ADD REPLYlink modified 13 months ago • written 13 months ago by josh.cutts10

It's this study, right? - https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE64758

Just to check: are you sure that you're not processing one of the GRO-seq samples with the ChIP pipeline? I can see that you were following the Supplementary Methods, too.

GL000220.1 is just a relatively short human contig, by the way, relating to ribosomal RNA I believe. I don't know why it comes up here.

You may have to paste your full pipeline in order for someone to pick up what may have gone wrong here.

ADD REPLYlink written 13 months ago by Kevin Blighe39k

Yes that's correct, that is the study I'm looking at!

No I'm not using the GRO-seq samples. I've only downloaded the following TF ChIP-seq fastq files for analysis, just trying to recreate what was already done (and input):

SRR2037028 SRR2037029 SRR1745491 SRR1745492 SRR1745492 SRR1745493 SRR174594

Sorry, I'm since I'm pretty new to this type of analysis could you please clarify by what you mean by my full pipeline? This would be my guess:

I've just used the supplemental methods as my guide basically. Bowtie2 for mapping, HOMER findPeaks to call peaks with default -style factor parameters except with 200 bp peaks and peaks with 30 or more reads. Then I tried to use annotatePeaks but haven't been able to get any annotations or genes in my annotated peak file.

ADD REPLYlink written 13 months ago by josh.cutts10

Oh, my apologies, by pipeline, I just meant to paste the actual commands that you have used from the very start up to the annotatePeaks command. Nothing new, by the way, the following the HOMER procedure can be tricky because it's such a comprehensive tool.

Edit: that includes how you install cache for the reference genome of interest, if you have that (hg19)

ADD REPLYlink modified 13 months ago • written 13 months ago by Kevin Blighe39k
bowtie2 -p 4 -x ~/pathtoindex/hg19_index -U SRR1745494.fastq > SRR1745494.sam

makeTagDirectory bcatenin_WNT3a/ SRR1745494.sam 

findPeaks Bcatenin_WNT3a/ -style factor -o auto -i H1_Input/

annotatePeaks.pl peaks.txt hg19 > bcatenin_WNT3a_peaks_annotated.txt

I made the index for hg19 awhile ago so I don't really remember how I did that. Do you think that could be the problem?

I forgot to mention that I put the bam file into IGV and the peaks look the same as the bigwig files that the authors posted.

ADD REPLYlink modified 13 months ago • written 13 months ago by josh.cutts10

Oh, hang on, I am not sure if this will resolve the problem but can you add the following steps after your alignment step:

#Convert SAM to BAM
samtools view -bS SRR1745494.sam > SRR1745494.bam

#Sort the BAM files
samtools sort  SRR1745494.bam -o SRR1745494.Sorted.bam

#Mark duplicates
java -jar picard-tools-1.119/MarkDuplicates.jar INPUT=SRR1745494.Sorted.bam OUTPUT=SRR1745494.Sorted.PCRduped.bam ASSUME_SORTED=true METRICS_FILE=SRR1745494.Sorted.PCRduped.txt VALIDATION_STRINGENCY=SILENT

#Expunge the duplicate reads
samtools view -b -F 0x400 SRR1745494.Sorted.PCRduped.bam > SRR1745494.Sorted.PCRdupesRemoved.bam

#Index the new BAM files with duplicates removed
samtools index SRR1745494.Sorted.PCRdupesRemoved.bam

The PCR duplicate steps are most likely not necessary just for testing; however, sorting your SAM file, indexing it, and saving as BAM cannot do any harm and may resolve the issue.

You should probably do some FASTQ trimming prior to alignment, too, but it's not entirely necessary. It's possible that the FASTQ files given by the authors are already QC'd.

ADD REPLYlink written 13 months ago by Kevin Blighe39k

Thanks for the quick reply. Just to clarify: after sorting and removing PCR duplicates I should use the sorted bam files to make the tag directories to call peaks against?

ADD REPLYlink written 13 months ago by josh.cutts10

Yep, that's right. These kind of intermediary steps are not typically mentioned in manuscript methods. I'm not sure if HOMER requires the sorted BAM but it could just be what is causing the problem. Your alignment metrics otherwise look great.

Also just to confirm: you should process all samples (including the input controls) in this way, unfortunately.

For now, the duplicates step really isn't necessary just to figure out if this is the issue, so, you could easily just do:

#Convert SAM to BAM
samtools view -bS SRR1745494.sam > SRR1745494.bam

#Sort the BAM files
samtools sort  SRR1745494.bam -o SRR1745494.Sorted.bam

#Index the new BAM files with duplicates removed
samtools index SRR1745494.Sorted.bam
ADD REPLYlink modified 13 months ago • written 13 months ago by Kevin Blighe39k

I've just tried creating tag directories with the sorted and indexed bam files and using these to call peaks using findPeaks, then used the peaks file to annotate using annotatePeaks but have the same problem, none of the peaks are being annotated.

ADD REPLYlink written 13 months ago by josh.cutts10

Okay, the only other glaring omission is the indication in the methods by the authors to only use 'uniquely mapped' reads. Bowtie2 does not have an implicit parameter that allows for the inclusion of just "uniquely mapped" reads; however, you can virtually ensure unique mapping by setting the MAPQ threshold high, as also recommended by Devon here: How to extract unique mapped results from Bowtie2 bam results? (see also here: Mapping quality: higher = more unique)

If there are still issues, you may consider contacting the author, whose email is on the GEO page (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE64758)

ADD REPLYlink written 13 months ago by Kevin Blighe39k

I tried to set the MAPQ thereshold high for the mapping as you recommended and go through the same pipeline but again there were no annotations.

I went back again and tried to change the chromosome column from the Homer peak file (column 2) from just "numbers" to chr"numbers" as it is in the annotation files and it worked. I thought I had tried this before and it didn't work but I must have done something differently this time. Thanks for all your help.

In case anyone else runs into this problem I wanted to include the commands I used to change my Homer peak file (adapted from https://unix.stackexchange.com/questions/148114/how-to-add-words-to-an-existing-column):

cat peaks.txt | awk -F'\t' -vOFS='\t' '{ $2 = "chr" $2 }1' > peaks_chr.txt
ADD REPLYlink written 13 months ago by josh.cutts10

lol - okay, that would have been the first thing that I'd have asked (adding 'chr'), but you mentioned how you had tried it. Either way, glad that you got it sorted out.

ADD REPLYlink written 13 months ago by Kevin Blighe39k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1134 users visited in the last hour