I downloaded a genome file and a gff file from NCBI ftp site. While extracting the promoter sequence from the upstream of gene of interest, there are a few sequence that basically just Ns. The windows I set was 1000bp and when I tried a bigger range such as 3k bp using bedops, the same sequence is still there. Is this common ? Will it affect downstream analysis using MEME, TOMTOM and enrichment? Should I keep them or remove them, if so, may I have some suggestion? Thank you very much.
MEME will treat N as any base (see their DNA alphabet table). If you can avoid including these in your input regions, you should get cleaner search results. Starting with good MEME matrices will be useful, if you do TOMTOM searches for matches with published TFs, and further enrichment calculations off of that.