Question: 20-mer DNA sequence list - assess similarity
gravatar for meraner
6 months ago by
meraner20 wrote:

I perform CRISPR screens in mammalian cells and also analyze screen data (using linux, bowtie2, samtools, and R).

Here is my problem: Imagine a long list of 20-mer DNA sequences (or sgRNA sequences) where each sequence occurs exactly one time. In spite of each sequence being unique, there could still be a higher degree of similarity between many of them. What is the best algorithm to assess the level of similarity in such a list ?

To simplify the problem (and solution), let's fist look only at sequence differences with a single base mismatch, and/or a single gap. That should help to expand to higher divergences. I am not really interested in anything that goes beond 2 mismatches or gaps.

I was thinking about a matrix-style output. Could probably come up with some code myself, but a really fast solution is needed here, since I'm dealing with lists of up to a million 20-mer sequences. I was thinking about dendrogram algorithms but that idea didn't get me very far.

sequencing sequence R • 320 views
ADD COMMENTlink modified 5 months ago by Damian Kao15k • written 6 months ago by meraner20

Hi jrj.healey,

thanks for your response ! I set up an implementation of the Levenshtein algorithm that works but speed is still a problem. However, I feel there is room for improvement because I don't need precise information beyond Levenshtein distances of 2.

On Wikipedia I found a matrix-based algorithm (based on the Wagner–Fischer algorithm) that I implemented it in C++. Matrix values are filled in in progressing squares from the top left corner. Since I am only interested in Levenshtein distances of 0, 1, and 2, I should gain speed by thresholding the calculations accordingly.

But I don't see any speedup - calculating the full matrix (for a test set of hundred 20-mers) is about the same as calculating only part of the matrix. The reason must lie in the implementation. I set up a Linux bash script that feeds 20-mers into the Levenshtein function. Am I better offf writing everything in C++ ? Or is there another programming language that is particularly suited for this ?

ADD REPLYlink written 5 months ago by meraner20

I’m no expert on algorithm optimisation.

If you were clever about partitioning your pairwise comparisons you might be able to brute force the process by simply parallelising the function. Perhaps that would be enough?

How have you implemented your Levenshtein function? In C++?


ADD REPLYlink modified 5 months ago • written 5 months ago by jrj.healey11k

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

This should have been a comment under @healey's answer.

ADD REPLYlink written 5 months ago by genomax63k

What is the underlying problem? What is the source of the kmer list? How the lengthier description of the kmer question relates to CRISPR screen?

ADD REPLYlink written 6 months ago by h.mon24k
gravatar for jrj.healey
6 months ago by
United Kingdom
jrj.healey11k wrote:

Use CD-HIT and cluster all your sequences at whatever threshold you desire. It will group the similar sequences together, and also give you a representative sequence for all clusters.

You could hand-code something using Levenshtein or Hamming distances, they’re famously not very fast however, as string comparisons rarely are.

ADD COMMENTlink modified 6 months ago • written 6 months ago by jrj.healey11k
gravatar for Damian Kao
5 months ago by
Damian Kao15k
Damian Kao15k wrote:

Starcode is for UMI clustering, but it should also work for your purposes:

ADD COMMENTlink written 5 months ago by Damian Kao15k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1995 users visited in the last hour