(I've rewritten the question with more detail, but left the original version alone). See the section starting with
I've been working on on a method for binary classification of DNA sequences. At the time I started working again on it (this was an old project that I resurrected) I wrote a post here - Resurrecting Dna Motif Finding Project that can serve as background.
In more detail, here is what the method does.
Given a family of DNA sequences, for example DNA sequence motifs, I try to predict whether other sequences belong to this family, by measuring their similarity to the sequences in the family. My method fits a distribution to the sequences, and then assigns a pvalue to sequences not in the family.
In the data I'm using, the sequences are all of length 28. The sequences I am analyzing are here, specifically human 12 RSS and mouse 12 RSS.
Since a comparison to existing methods is always a good thing, I am wondering what are the standard methods to beat, if any? I'm not very familiar with the available methods/algorithms.
I am in the process of trying MEME. This does not seem to do exactly what I want, and I don't know if I will be able to persuade it to. Specifically, I'm not sure if I can tell it that the sequences of 28 length are the motif. I got the impression from the documentation that it decides what the motifs are by itself, or something.
I can give further details if necessary.
Ok. here is a more detailed version of the same question.
I've developed a method for motif sequence search, and I'm trying to find a method to compare it with, because reviewers like to see how your method compares with what is out there. However, I am having some difficulty in finding such a method. To be clear, this is not a de novo motif discovery method, but is related. Here are more details about what I have done.
I'm analyzed two RSS data sets, each of which is a collection of RSS sequences. The fasta files for these data sets are at http://www.itb.cnr.it/rss/stats/MM12RSS.fasta">mouse 12 RSS.
The main purpose of the analysis is to predict whether sequences not in this family belong to the family. So, I used a cross-validation method. I divided each data set into 5 parts, and used 4 of the five parts as a training set in turn. (The number 5 here is a bit arbitrary, but since I wanted to include the results per training set, I didn't want the number to be too large.) After fitting a model to the training set, I then used this model for prediction as follows.
The RSS data set is contained in gene segments, typically one or two RSS per gene segment. The gene segments are often much larger than the RSS. These are 12RSS, so each RSS is of length 28. I took all the gene segments I could find that contained an RSS, and selected from them all contiguous sequences of length 28. The current total number of these sequences is 449905 for one, and 624400 for the other. The corresponding number of RSS is 118 and 201. Note that these sets did not necessarily contain all distinct values.
I then used the model derived from the training set to calculate pvalues for all these approx 500,000 sequences, omitting the RSS sequences that were in the training set. (I'm leaving out some details here, but I don't think it is important how exactly I calculated the values.)
Then I ranked the sequences by order of decreasing pvalues. The hope was that the remaining RSS sequences would rank highly in this ranking, and in the event they did.
Now, I'd like to find an algorithm which is already implemented in software, which can perform a similar procedure in a reasonable amount of time, so I can compare the results. Please let me know if you know of any such thing. Thanks.