Question

Dataset to help benchmark approximate string matching algorithm?

0

Entering edit mode

3.6 years ago

amirouche.boubekki ▴ 10

At a meetup about software engineering, I met two persons working in bio-informatics that told me about a problem they have about matching a given RNA sequence to a database of RNA sequences. My understanding of their problem was basically, that given a RNA sequence they want to find out similar RNA sequence.

I did not figure it at the moment but I have, as far as i know, discovered a novel algorithm that allows given a string to search for similar string in a database. It works really well with a limited vocabulary. I tested it to do spell checking I achieve between 77% and 94% success rate depending on how I turn the knobs a) number of candidates b) max levenshtein distance.

The best of all, is that it works on bigger than memory database (also it is fast :)

Anyway, I would to try that algorithm for rna sequence matching.

Do you know any dataset that will help me for that task?

dataset rna-seq • 1.3k views

ADD COMMENT • link updated 3.6 years ago by Mensur Dlakic ★ 27k • written 3.6 years ago by amirouche.boubekki ▴ 10

1

Entering edit mode

You developed a new alignment method? Be sure to extensively check the literature, there are so many alignment tools out there that it is at least of limited likelihood that you discovered something ground-breakingly new. Is it better than BLAST for low-throughput and local alignments? Is it better than hisat2 for short queries, or minimap2 for longer queries when mapping sequences onto a reference?

ADD REPLY • link 3.6 years ago by ATpoint 81k

0

Entering edit mode

Part of the problem is that I do not know the correct vocabulary to build my search queries, I will look around the software you mentioned. Thanks for the input. (Yes, I know it is very unlikely, but no discovery was made without prior art)

ADD REPLY • link 3.6 years ago by amirouche.boubekki ▴ 10

0

Entering edit mode

Can you describe your method a bit? What makes it unique to e.g. BLAST?

ADD REPLY • link 3.6 years ago by ATpoint 81k

0

Entering edit mode

I am still not sure how BLAST work. The idea is to hash the subject sequence a lookup nearest match in the database, and then score a subset of the nearest sequences using distance metric, it rely on one-hot-encoding of the vocabulary and Merkle tree, see https://hyper.dev/blog/fuzzbuzz-hash-algorithm.html

ADD REPLY • link 3.6 years ago by amirouche.boubekki ▴ 10

0

Entering edit mode

Does it make sense? What do you think?

ADD REPLY • link 3.6 years ago by amirouche.boubekki ▴ 10

score 1 · Answer 1 · 2020-09-23

I have a saying that kmer based "matching" methods eventually replace every single bioinformatics approach. It just takes time.

As for your method, work with your collaborators, find other methods, and compare it to those. Even if the method is not completely novel, identifying a certain practical uses for your algorithm can be of high utility.

Once I have rediscovered Gaussian kernel smoothing and used it for peak detection, I did not know what it was called, I knew that cannot possibly be novel, just that I had no idea what it is called. Still, to this day I am proud of my "discovery". At that time there were no gaussian kernel smoother peak detectors "on the market", so we were able to find features no one else could.

With my collaborators we ended up wroting a bunch of papers, not on the method, but the discoveries that we made with it.

score 1 · Answer 2 · 2020-09-23

There are numerous hashing algorithms to compare biological sequences. I am listing only several that I have tried, so this is not an exhaustive list.

A simple GitHub search will give you plenty of methods to compare to yours. There is also a review article that may be helpful.