HOMER de novo motif enrichment
0
0
Entering edit mode
5.0 years ago
Lucy ▴ 140

Hi all,

I am performing Homer findMotifsGenome.pl to identify motifs enriched in a subset of ATAC-seq peaks (~2500 peaks) vs. a control set of ATAC-seq peaks (~4000 peaks). I ran the command using either the binomial distribution to score motifs or hypergeometric enrichment scoring.

Changing the scoring approach resulted in very different motifs being identified as enriched. Which method should I be using in my case, and which results can I trust more?

Best wishes,

Lucy

HOMER Motif enrichment • 3.8k views
0
Entering edit mode

Calculating Motif Enrichment:

Motif enrichment is calculated using either the cumulative hypergeometric or cumulative binomial distributions. These two statistics assume that the classification of input sequences (i.e. target vs. background) is independent of the occurence of motifs within them. The statistics consider the total number of target sequences, background sequences and how many of each type contains the motif that is being checked for enrichment. From these numbers we can calculate the probability of observing the given number (or more) of target sequences with the motif by chance if we assume there is no relationship between the target sequences and the motif. The hypergeometric and binomial distributions are similar, except that the hypergeometric assumes sampling without replacement, while the binomial assumes sampling with replacement. The motif enrichment problem is more accurately described by the hypergeometric, however, the binomial has advantages. The difference between them is usually minor if there are a large number of sequences and the background sequences >> target sequences. In these cases, the binomial is preferred since it is faster to calculate. As a result it is the default statistic for findMotifsGenome.pl where the number of sequences is typically higher. However, if you use your own background that has a limited number of sequences, it might be a good idea to switch to the hypergeometric (use "-h" to force use of the hypergeometric). findMotifs.pl exects smaller number for promoter analysis and uses the hypergeometric by default.

One important note: Since HOMER uses an Oligo Table for much of the internal calculations of motif enrichment, where it does not explicitly know how many of the original sequences contain the motif, it approximates this number using the total number of observed motif occurrences in background and target sequences. It assumes the occurrences were equally distributed among the target or background sequences with replacement, were some of the sequences are likely to have more than one occurence. It uses the expected number sequences to calculate the enrichment statistic (the final output reflects the actual enrichment based on the original sequences).

0
Entering edit mode

Thank you - yes I have read this. Based on this, I thought the hypergeometric sounded appropriate. With binomial scoring, Homer identified 27 enriched motifs, compared to 3 using hypergeometric. The reason that I am unsure of the best approach is that elsewhere in the Homer documentation, it says to use the hypergeometric if the number of background sequences < number of target sequences, which is not the case for me.