Question

Identify binding motifs within large super enhancer region

0

Entering edit mode

9 months ago

mkunika • 0

Hello,

From my H3K27ac ChIP seq data, I have identified 500 super enhancer regions using Homer's findPeaks -style super. From the super enhancer regions, I found 4 enriched binding motifs within the 500 super enhancer regions using Homer's findMotifsGenome.pl.

First, I would like to find all significant binding motifs within a single super enhancer region (can be as large as 85kb). So far, I have used MEME and Tomtom from the MEME suite, RSAT, and TRAP. Which motif discovery algorithm should I use to identify many/all motifs within a long region? Ideally the algorithm should have motif database comparison to identify which TFs belong to the motif. It would also be a big plus if the discovered motifs could be visualized on a genome viewer like IGV.

MEME: I have set MEME to find the top 20 motifs within the region. However, most of motifs found are repeat DNA sequences. Is there a way to filter out the repeat sequences during motif discovery in MEME?
RSAT: I ran its motif discovery algorithm using 3 different TF binding motif databases: Jaspar core nonredundant vertebrates database, footprintDB, and homer's. The 3 runs discovered the same motifs (with the same sequences), but output completely different TFs matches for each motif. What database would be best to reference for human cells?
TRAP only takes sequences up to 5kb and is not ideal for my very large regions.

Second, I would like to identify all the super enhancers that contain the top enriched motif. So far, I plan on running all 500 super enhancer regions through FIMO in MEME suite. However, this approach doesn't seem most optimal. I would really appreciate any recommendations.

Thank you!

MEME Jaspar enhancer Homer motifs super • 600 views

ADD COMMENT • link updated 9 months ago by Alex Reynolds 35k • written 9 months ago by mkunika • 0

score 0 · Answer 1 · 2023-07-10

If you're using an established TF model database like Jaspar (which is a tag in your question), you would be able to use FIMO to predict putative binding sites in your regions.

A basic scenario for running FIMO: https://bioinformatics.stackexchange.com/questions/2467/where-to-download-jaspar-tfbs-motif-bed-file/2491#2491

This approach lets you generate a BED file containing genomic regions within a given reference genome, which are putative TF binding sites ("FIMO hits").

This BED file can be mapped via bedmap or bedops etc. to retrieve TF site calls located within peaks or other genomic regions, e.g.,

bedmap --echo --echo-map-id-uniq --delim '\t' peaks.bed fimoHits.bed > answer.bed

Make sure chromosome name schemes match between BED files, in order to do set operations. Your peaks might be using Ensembl's scheme (1, etc.), while the output from FIMO in my example will generate UCSC names (chr1, chr2, etc.). An awk preprocessing step can fix things in either file. It might also be helpful to run sort-bed on your peaks to ensure they are sorted before operations.