At a meetup about software engineering, I met two persons working in bio-informatics that told me about a problem they have about matching a given RNA sequence to a database of RNA sequences. My understanding of their problem was basically, that given a RNA sequence they want to find out similar RNA sequence.
I did not figure it at the moment but I have, as far as i know, discovered a novel algorithm that allows given a string to search for similar string in a database. It works really well with a limited vocabulary. I tested it to do spell checking I achieve between 77% and 94% success rate depending on how I turn the knobs a) number of candidates b) max levenshtein distance.
The best of all, is that it works on bigger than memory database (also it is fast :)
Anyway, I would to try that algorithm for rna sequence matching.
Do you know any dataset that will help me for that task?