Publication detailing this version: Current Protocols in Bioinformatics
Original Weeder publication: Nucleic Acids Research - Webserver Issue
For other tools from this group see their MoD Tools web site.
I have been using Weeder as my "go to" tool for motif discovery (MD) since it was released in 2004. There have been a number of version 1 updates, but I am delighted to report the release of Weeder 2.0. I am not affiliated with the group, just a fan.
Weeder falls into the motif enumeration family of MD tools in which the occurrence of motifs in the query sequences are counted and, in this case, compared to a pre-calculated set of genome specific background motifs. This has the wonderful benefit of not having to construct a background set of sequences (no easy task). It was initially used to identify common motifs in defined promoter regions, but evolved to consider first ChIP-chip and then ChIP-seq data.
This version have several notable features:
- A single binary (executable) file
- A reduced set of input parameters (and assumptions to make about the motifs)
- A new heuristic for the analysis of large ChIP-seq data sets
- Redundant motif filtering
I would urge you to read the excellent Current Protocols article as it contains detailed information on the installation, usage and thinking behind the mechanics of Weeder. Plus there are details of two other useful tools: Pscan and PscanChIP (scanning with known motifs).
Weeder 2.0 additional information:
- The default setting is recommended for the motif redundancy filter (-sim)
- The ChIP-seq heuristic (-chipseq) by default scans for the occurrence of oligos only in the first 100 sequences, so users should order sequences by their significance. For large data sets (-top) can be set to interrogate number if sequences equating to the top 10 to 20% of input sequences, as recommended by the author.
- An expectation maximisation (EM) step is included to help "clean up" the resulting motif matrices. In this version the number of EM steps can be increased, which can be useful for motifs with highly redundant stretches of sequence.
Motif discovery top tips:
- Use short (100-200bp) sequences that can be reasonably expected to contain an enriched motif(s). This is not generally an issue with transcription factor ChIP-seq derived sequences centred on the summit of binding regions, which are expected to contain a dominant motif and possibly secondary motifs.
- There is no need to mask sequence for repetitive sequence as factors may legitimately bind repetitive sequence.