I perform CRISPR screens in mammalian cells and also analyze screen data (using linux, bowtie2, samtools, and R).

Here is my problem: Imagine a long list of 20-mer DNA sequences (or sgRNA sequences) where each sequence occurs exactly one time. In spite of each sequence being unique, there could still be a higher degree of similarity between many of them. What is the best algorithm to assess the level of similarity in such a list ?

To simplify the problem (and solution), let's fist look only at sequence differences with a single base mismatch, and/or a single gap. That should help to expand to higher divergences. I am not really interested in anything that goes beond 2 mismatches or gaps.

I was thinking about a matrix-style output. Could probably come up with some code myself, but a really fast solution is needed here, since I'm dealing with lists of up to a million 20-mer sequences. I was thinking about dendrogram algorithms but that idea didn't get me very far.

Hi jrj.healey,

thanks for your response ! I set up an implementation of the Levenshtein algorithm that works but speed is still a problem. However, I feel there is room for improvement because I don't need precise information beyond Levenshtein distances of 2.

On Wikipedia I found a matrix-based algorithm (based on the Wagner–Fischer algorithm) that I implemented it in C++. Matrix values are filled in in progressing squares from the top left corner. Since I am only interested in Levenshtein distances of 0, 1, and 2, I should gain speed by thresholding the calculations accordingly.

But I don't see any speedup - calculating the full matrix (for a test set of hundred 20-mers) is about the same as calculating only part of the matrix. The reason must lie in the implementation. I set up a Linux bash script that feeds 20-mers into the Levenshtein function. Am I better offf writing everything in C++ ? Or is there another programming language that is particularly suited for this ?

I’m no expert on algorithm optimisation.

If you were clever about partitioning your pairwise comparisons you might be able to brute force the process by simply parallelising the function. Perhaps that would be enough?

How have you implemented your Levenshtein function? In C++?

See: https://stackoverflow.com/questions/16278874/how-to-speed-up-levenshtein-distance-calculation

Please use

`ADD COMMENT/ADD REPLY`

when responding to existing posts to keep threads logically organized.This should have been a comment under @healey's answer.

What is the underlying problem? What is the source of the kmer list? How the lengthier description of the kmer question relates to CRISPR screen?