Question: Parameters That Determine Meme Speed And Homer Speed For Motif Discovery?
2
gravatar for user
6.2 years ago by
user790
United States
user790 wrote:

I'm trying to use MEME for motif discovery on reads from a high throughput sequencing experiment. If I use MEME on something around the order of 400k reads, it becomes unbearably slow, and even if I drop to ~10k sequences it's pretty slow. I am using these parameters:

-dna -text -nmotifs 30 -maxsize 100000000 -maxw 15

The large -maxsize is required to make it run on so many sequences. I restricted motif width to 15 and the number of motifs to discover nmotifs to 30. Which of these make large differences in speed? That will help me optimize. Is it at all possible to use MEME on millions of sequences? What is the upper bound that's practical?

I would also like to try Homer so if anyone has thoughts on Homer speed and parameters that particularly affect the speed I would like to know. thanks.

meme sequence motif • 4.7k views
ADD COMMENTlink modified 5.3 years ago by Biostar ♦♦ 20 • written 6.2 years ago by user790
3
gravatar for Mikael Huss
6.2 years ago by
Mikael Huss4.6k
Stockholm
Mikael Huss4.6k wrote:

You are not supposed to use MEME on this kind of scale (tens of thousands of sequences). It will work well for a small set of sequences, say promoter sequences extracted from a list of differentially expressed genes. The number of input sequences, and the length of the input sequences, are what mostly determine how long it will take to run.

DREME from the same group is more geared towards ChIP-seq scale data, that is, tens of thousands of regions. HOMER can also handle large sets of sequences.

But most importantly, you may have to rethink your approach a bit. You seem to be trying to do motif discovery on raw reads, but does that really make sense? If you have hundreds of reads coming from the same genomic locus, isn't it better to just feed that locus into a motif finding algorithm rather than the raw reads, which just risks overwhelming the software? But maybe you have your reasons for doing it that way.

ADD COMMENTlink written 6.2 years ago by Mikael Huss4.6k
4

Concretely, the running time of MEME grows as the square of the total number of characters of the sequence and the cube of the number of sequences. This makes running MEME on more than about 10,000 sequences impractical on commodity hardware. MEME-ChIP works around this by sampling sequences from the input set and running MEME on only the sampled sequences. DREME's running time grows roughly linearly with the number of characters in the sequence data, but it's limited to motifs of width 8 or less.

ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by charlesegrant40
1

Many thanks for the detailed explanation!

ADD REPLYlink written 6.2 years ago by Mikael Huss4.6k
3
gravatar for Ryan Dale
6.2 years ago by
Ryan Dale4.8k
Bethesda, MD
Ryan Dale4.8k wrote:

MEME-ChIP runs a collection of the MEME suite programs on ChIP-seq-scale data. From the docs, it looks like the limit (for the server, at least) is a file size limit of 50 MB:

"MEME-ChIP is designed especially for discovering motifs in LARGE (50MB maximum) sets of short (around 500bp) DNA sequences centered on locations of interest such as those produced by ChIP-seq experiments."

(For more, see this Recommended Tools For De Novo Motif Discovery In Vertebrate Genome Tsses?)

ADD COMMENTlink written 6.2 years ago by Ryan Dale4.8k
2
gravatar for Alex Reynolds
6.2 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

Following up on Mikael's answer, you may have reasons for doing what you're doing. If you have access to a computational cluster, you might look into compiling meme_p, a variant of meme that incorporates OpenMPI components to spread out the work on multiple nodes.

You might build it like so:

$ cd /home/foo/meme_4.9.0
$ ./configure \
    --prefix=/home/foo/meme_4.9.0 \ 
    --with-url="http://meme.nbcr.net/meme" \
    --enable-openmp \
    --enable-debug \
    --with-mpicc=/opt/openmpi-1.6.3/bin/mpicc \
    --enable-opt

You need to add the OpenMPI lib path to your LD_LIBRARY_PATH environment variable, _e.g._ LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/openmpi-1.6.3/lib etc. in your environment setup. Your OpenMPI installation must also be present on or available to each cluster node.

A Sun Grid Engine-based script called runall.cluster would fire off the search as follows, supposing your sys admin has set up a parallel environment called mpi_pe (for example) with at least 64 slots:

#!/bin/bash

#
# runall.cluster
#

#$ -N memeCluster64
#$ -S /bin/bash
#$ -pe mpi_pe 64
#$ -v -np=64
#$ -cwd
#$ -o "memeCluster64.out"
#$ -e "memeCluster64.err"
#$ -notify
#$ -V

time /opt/openmpi-1.6.3/bin/mpirun \
       -np 64 \
       /home/foo/meme_4.9.0/bin/meme_p \
           /home/foo/meme_4.9.0/data/myReads.fa \
           -oc /home/foo/meme_4.9.0/output/myReads.fa.meme \
           -dna \
           -text \
           -nmotifs 30 \
           -maxsize 100000000 \
           -maxw 15

To run it:

$ qsub ./runall.cluster

In our environment, testing showed immediate benefit with as few as 8 or 16 nodes, with diminishing returns after about 32-64 nodes. You could use GNU time to do the same runtime testing on your end, i.e., measuring execution time vs nodes on a small test sequence set, in order to find a "sweet spot" where your job will run faster without taking up too much of the cluster.

ADD COMMENTlink modified 5.2 years ago • written 6.2 years ago by Alex Reynolds28k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1402 users visited in the last hour