Question

Scoring The Results From Different Motif Discovery Software

2

Entering edit mode

12.4 years ago

Dhillonv10 ▴ 110

Hi all,

I'm trying to compare the results from different protein motif discovery tools out there such as FirePro, Slimfinder and MEME. So after running the same data through each of hte softwares, I get some nice results now then the question becomes which software did better? Note that there 3 find most of the motifs but miss some. I realize some scoring schemes have been implemented such as bio-optimizer in tmod (toolbox of motif discovery) however these tools are closed-off in some sense. I can't use them to compare the results from algorithms except the ones that come pre-build. Thanks!

motif scoring algorithm • 4.1k views

ADD COMMENT • link updated 10.1 years ago by Biostar 20 • written 12.4 years ago by Dhillonv10 ▴ 110

score 2 · Answer 1 · 2011-12-01

The answer to the question "which software did better?" depends very much on what you are hoping to achieve. Is this a benchmarking study, in which you know the motifs you are looking for, or is this a motif discovery project?

If the former, then the standard method is some kind of sensitivity (TP/(TP+FN)) versus specificity (TP/(TP+FP)) analysis. Even here, though, "best" depends on the downstream application.

The "best" performance really depends what you want to do next. If you want high confidence in your results, you need a method with statistical significance. SLiMFinder, for example, has pretty and well bench-marked statistics that account for evolutionary biases etc. but it is pretty conservative as a result. If you are going to test a lot of motifs and are not too worried about the False Discovery Rate as long as the motif is in there somewhere, you probably just want to look at the top results from several methods. (Relax the cut-offs if you are doing this to make sure that they all return some motifs.) They all have their own biases and will perform better or worse on different kinds of data. Unless you know what the biases in your own data are and can explicitly pick a model that represents that knowledge - lucky you, if so! - it is not always going to be easy to judge. (Just make sure that any methods you use are accounting for evolutionary relationships between your input proteins and/or you have screened those out prior to analysis, otherwise these will dominate the results of some methods.)

The other alternative is to put results from one method through the statistical models of another. You could feed regular expressions from another predictor to SLiMFinder (the "slimcheck" function) or its related program, SLiMSearch, to see what statistical support they have. You can also change the statistical model to be based on enrichment versus a background dataset rather than the composition of your search dataset, also this inherently has problems of biases introduced by protein families that I do not think anyone has solved. (It can provide nice corroborative evidence, though.)

score 1 · Answer 2 · 2011-11-30

1

Entering edit mode

12.4 years ago

Maximilian Haeussler ★ 1.6k

If you have a foreground gene set?

If yes: Try to rank the motifs by their matches. E.g. if the found motifs match 30% of your background genes but 50% of your target genes, is this a good result? The TAMO toolbox includes various scoring functions for this http://fraenkel.mit.edu/TAMO/

ADD COMMENT • link 12.4 years ago by Maximilian Haeussler ★ 1.6k

1

Entering edit mode

thanks for your answer @maximilianh, TAMO is unfortunately another DNA-sequence motif discovery method, I'm working with protein-motifs such as the ones from ELM database.

ADD REPLY • link 12.4 years ago by Dhillonv10 ▴ 110

0

Entering edit mode

Oh I see. Didn't realize that you're working with linear motifs. I think many approaches from DNA motifs translate very well to linear motifs. You don't need the TAMO toolbox. My main point was that you score should be related to some other functional annotation of the gene/protein where the motif was found. So you could calculate functional enrichment for your motif or other protein domains. If you motif is functional, it should ofte co-occur with another functional domain, I assume?

ADD REPLY • link 12.4 years ago by Maximilian Haeussler ★ 1.6k

score 1 · Answer 3 · 2011-12-01

1

Entering edit mode

12.4 years ago

Larry_Parnell 16k

You could employ a simple scoring scheme where a point is earned if the motif is predicted by an algorithm/tool and zero if not. Add the scores and the highest total is deemed the most reliable prediction for that motif. This simple approach has been employed for risk scores (for a given disease phenotype) from a panel of genetic variants.

A slightly more sophisticated approach is to do the same, but weight tools/algorithms according to some underlying characteristics such as known accuracy in detecting true vs false iterations of the motif. With four motif predictors (FirePro, Slimfinder and MEME, plus ELM) there may not be much value gained. And it may be difficult to assign weights to the score of motif present and motif absent for each motif. Nonetheless, it could be worth trying.

ADD COMMENT • link 12.4 years ago by Larry_Parnell 16k

1

Entering edit mode

Overlapping positions and nearly the same motif - score them as equal; this is option 1. Or, option 2, break them into two motifs: One predicts S..K.TQT and another does not. But SlimFinder does not predict KxTQT, so zero points. Then, you'd need to combine them in some way later, IMO. If I were doing this work, I would go with option one. The presence of the serine is a refinement of the KxTQT motif, but is the same motif is essence.

ADD REPLY • link 12.4 years ago by Larry_Parnell 16k

0

Entering edit mode

thanks for your answer Larry, I've been thinking of simply adding scores and such but how would one answer this using your approach, I got this pattern from SlimFinder: S..K.TQT now another tool DiLiMOT gives me: KxTQT now how does one score this?

ADD REPLY • link 12.4 years ago by Dhillonv10 ▴ 110