Question: HOMER de novo motif enrichment
0
18 months ago by
Lucy40
Lucy40 wrote:

Hi all,

I am performing Homer findMotifsGenome.pl to identify motifs enriched in a subset of ATAC-seq peaks (~2500 peaks) vs. a control set of ATAC-seq peaks (~4000 peaks). I ran the command using either the binomial distribution to score motifs or hypergeometric enrichment scoring.

Changing the scoring approach resulted in very different motifs being identified as enriched. Which method should I be using in my case, and which results can I trust more?

Best wishes,

Lucy

motif enrichment homer • 1.3k views
ADD COMMENTlink
modified 18 months ago • written 18 months ago by Lucy40

Did you read the manual?

Calculating Motif Enrichment:

Motif enrichment is calculated using either the cumulative hypergeometric or cumulative binomial distributions. These two statistics assume that the classification of input sequences (i.e. target vs. background) is independent of the occurence of motifs within them. The statistics consider the total number of target sequences, background sequences and how many of each type contains the motif that is being checked for enrichment. From these numbers we can calculate the probability of observing the given number (or more) of target sequences with the motif by chance if we assume there is no relationship between the target sequences and the motif. The hypergeometric and binomial distributions are similar, except that the hypergeometric assumes sampling without replacement, while the binomial assumes sampling with replacement. The motif enrichment problem is more accurately described by the hypergeometric, however, the binomial has advantages. The difference between them is usually minor if there are a large number of sequences and the background sequences >> target sequences. In these cases, the binomial is preferred since it is faster to calculate. As a result it is the default statistic for findMotifsGenome.pl where the number of sequences is typically higher. However, if you use your own background that has a limited number of sequences, it might be a good idea to switch to the hypergeometric (use "-h" to force use of the hypergeometric). findMotifs.pl exects smaller number for promoter analysis and uses the hypergeometric by default.

One important note: Since HOMER uses an Oligo Table for much of the internal calculations of motif enrichment, where it does not explicitly know how many of the original sequences contain the motif, it approximates this number using the total number of observed motif occurrences in background and target sequences. It assumes the occurrences were equally distributed among the target or background sequences with replacement, were some of the sequences are likely to have more than one occurence. It uses the expected number sequences to calculate the enrichment statistic (the final output reflects the actual enrichment based on the original sequences).

ADD REPLYlink modified 18 months ago • written 18 months ago by ATpoint35k

Thank you - yes I have read this. Based on this, I thought the hypergeometric sounded appropriate. With binomial scoring, Homer identified 27 enriched motifs, compared to 3 using hypergeometric. The reason that I am unsure of the best approach is that elsewhere in the Homer documentation, it says to use the hypergeometric if the number of background sequences < number of target sequences, which is not the case for me.

ADD REPLYlink written 18 months ago by Lucy40
Please log in to add an answer.

Content
Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 689 users visited in the last hour