Please share the best tool for de novo motif discovery in large dataset
3
1
Entering edit mode
9.2 years ago
seta ★ 1.9k

Hi everybody,

Could anybody please let me know what is the best tool for de novo motif discovery in large dataset, say 50 Mb sequencing file with some sequences up to 2000 bp in length? Looking forward to hearing your helpful suggestions.

genome rna-seq sequence • 3.5k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

thanks, but my mean is de novo motif discovery. I don't looking for particular motif. I found MEME is the best for short sequences, I'm looking for something like that for large sequencing data that contain almost long sequences.

ADD REPLY
0
Entering edit mode

Have you tried to run MEME/DREME on command line using the available options?

ADD REPLY
0
Entering edit mode

Yeah, but the error "dataset is too large" was appeared and it doesn't work even by changing "-maxsize". Besides, as far as I read, MEME randomly just select 600 bp from input sequence and find the motif on central 100 bp. So, how it can work well for sequences up to 2000 or more in length?

ADD REPLY
2
Entering edit mode
9.1 years ago
Felix Francis ▴ 600

Detection rate for any individual motif prediction tool alone is bad whether its is for small or large data sets. The best approach is to use a combination different tools to get more reliable results.

Some of the best ranked ones are: Meme, MotifSampler and Weeder (ref: Tompa et al., Assessing computational tools for the discovery of transcription factor binding sites, Nature Biotechnology, 23,1,137-144)

Adjust the parameters for each of these tools to maximize true positives (based on any training data set). De novo prediction results can be very sensitive to these parameter settings.

ADD COMMENT
1
Entering edit mode
9.1 years ago
macmath ▴ 170

RSAT (Regulatory Sequence Analysis Tools) comprises a wide collection of modular tools for the detection of cis-regulatory elements in genome sequences. Thirteen new programs have been added to the 30 described in the 2008 NAR Web Software Issue, including an automated sequence retrieval from EnsEMBL (retrieve-ensembl-seq), two novel motif discovery algorithms (oligo-diff and info-gibbs), a 100-times faster version of matrix-scan enabling the scanning of genome-scale sequence sets, and a series of facilities for random model generation and statistical evaluation (random-genome-fragments, random-motifs, random-sites, implant-sites, sequence-probability, permute-matrix). Our most recent work also focused on motif comparison (compare-matrices) and evaluation of motif quality (matrix-quality) by combining theoretical and empirical measures to assess the predictive capability of position-specific scoring matrices. To process large collections of peak sequences obtained from ChIP-seq or related technologies, RSAT provides a new program (peak-motifs) that combines several efficient motif discovery algorithms to predict transcription factor binding motifs, match them against motif databases and predict their binding sites. Availability (web site, stand-alone programs and SOAP/WSDL (Simple Object Access Protocol/Web Services Description Language) web services): http://rsat.ulb.ac.be/rsat/.

http://pedagogix-tagc.univ-mrs.fr/rsat/

ADD COMMENT
0
Entering edit mode
9.2 years ago

perl -ne '/motif/ and print' file

awk '/motif/' file

...

ADD COMMENT
0
Entering edit mode

Thanks Jorge. would you please let me know the source of program and some detail?

ADD REPLY
0
Entering edit mode

awk is a bash builtin. perl is usually installed by default on most systems - you might have to install it if your system has never used perl before.

ADD REPLY
0
Entering edit mode

thanks for your explanation. yeah, perl was installed, however, as you mentioned that I'm looking for de novo motif discovery tool that can handle well large dataset with some long sequence, like 2000 bp.

ADD REPLY
0
Entering edit mode

Yeah, I realized that when I read through your post again. I think HMM or SVN based tools might help, but I haven't used any, so I'm not of much use here, unfortunately.

ADD REPLY
0
Entering edit mode

Jorge, I think OP is looking for de novo motifs, so something HMM based might be more appropriate, no?

ADD REPLY
0
Entering edit mode

there are plenty of ways to do motif finding. my answer was just to point out that if the question is not well described, very simple answers such as perl/awk/grep/... could be obtained. if the input and the motif is described, then the answers could be more useful.

ADD REPLY
0
Entering edit mode

I guess OP edited the question once they realized that it wasn't clear enough. But yeah, one has to be more specific when seeking help.

ADD REPLY

Login before adding your answer.

Traffic: 2359 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6