Question: 20-mer DNA sequence list - assess similarity
1
12 months ago by
meraner20
meraner20 wrote:

I perform CRISPR screens in mammalian cells and also analyze screen data (using linux, bowtie2, samtools, and R).

Here is my problem: Imagine a long list of 20-mer DNA sequences (or sgRNA sequences) where each sequence occurs exactly one time. In spite of each sequence being unique, there could still be a higher degree of similarity between many of them. What is the best algorithm to assess the level of similarity in such a list ?

To simplify the problem (and solution), let's fist look only at sequence differences with a single base mismatch, and/or a single gap. That should help to expand to higher divergences. I am not really interested in anything that goes beond 2 mismatches or gaps.

I was thinking about a matrix-style output. Could probably come up with some code myself, but a really fast solution is needed here, since I'm dealing with lists of up to a million 20-mer sequences. I was thinking about dendrogram algorithms but that idea didn't get me very far.

sequencing sequence R • 405 views
modified 11 months ago by Damian Kao15k • written 12 months ago by meraner20
1

Hi jrj.healey,

thanks for your response ! I set up an implementation of the Levenshtein algorithm that works but speed is still a problem. However, I feel there is room for improvement because I don't need precise information beyond Levenshtein distances of 2.

On Wikipedia I found a matrix-based algorithm (based on the Wagner–Fischer algorithm) that I implemented it in C++. Matrix values are filled in in progressing squares from the top left corner. Since I am only interested in Levenshtein distances of 0, 1, and 2, I should gain speed by thresholding the calculations accordingly.

But I don't see any speedup - calculating the full matrix (for a test set of hundred 20-mers) is about the same as calculating only part of the matrix. The reason must lie in the implementation. I set up a Linux bash script that feeds 20-mers into the Levenshtein function. Am I better offf writing everything in C++ ? Or is there another programming language that is particularly suited for this ?

1

I’m no expert on algorithm optimisation.

If you were clever about partitioning your pairwise comparisons you might be able to brute force the process by simply parallelising the function. Perhaps that would be enough?

How have you implemented your Levenshtein function? In C++?

ADD REPLYlink modified 11 months ago • written 11 months ago by Joe14k

Please use `ADD COMMENT/ADD REPLY` when responding to existing posts to keep threads logically organized.

This should have been a comment under @healey's answer.

What is the underlying problem? What is the source of the kmer list? How the lengthier description of the kmer question relates to CRISPR screen?

1
12 months ago by
Joe14k
United Kingdom
Joe14k wrote:

Use CD-HIT and cluster all your sequences at whatever threshold you desire. It will group the similar sequences together, and also give you a representative sequence for all clusters.

You could hand-code something using Levenshtein or Hamming distances, they’re famously not very fast however, as string comparisons rarely are.

0
11 months ago by
Damian Kao15k
USA
Damian Kao15k wrote:

Starcode is for UMI clustering, but it should also work for your purposes:

https://github.com/gui11aume/starcode