I have a bed file of enhancer sites that I'd like to run motif analysis on. I'm looking for core promoter elements (if any exist) for regions such as TATA-box, Sp1, Inf, etc.
I came across MEME, and while I admittedly haven't read the entirety of the manual (I'm working on it though!) I thought it would be a good idea to come here and ask for any common pitfalls for this type of analysis.
Specifically, I'm looking for advice to make this analysis statistically and biologically sound. Are the input files to MEME suite my bed file of enhancer sites, or should I first convert this bed file to fasta? Which of the MEME suite tools should I be using if my enhancer sites vary from no less than 20bp to no larger than 1000bp? What is the difference between MEME's novel, ungapped motif identifier and GLAM2's noval, gapped motif identifier? Which one would be better suited to this type of analysis?
PWMs for canonical core promoter elements have already been published. For example, Ohler 2002 Genome Biology has several of those. In addition, Vo Ngoc 2017 Genes Development recently refined the Inr element.
Unless you are looking for novel core promoter elements, I recommend you just used these prior annotations, and set the P- or E-value cutoff on your own. Also, in Ohler 2002 Genome Biology all of the elements had like 12nt long. You definitely need to cut it to 4 to 8 to keep just the positions with most information in your analysis.