Question

20-mer DNA sequence list - assess similarity

1

Entering edit mode

5.7 years ago

meraner ▴ 20

I perform CRISPR screens in mammalian cells and also analyze screen data (using linux, bowtie2, samtools, and R).

Here is my problem: Imagine a long list of 20-mer DNA sequences (or sgRNA sequences) where each sequence occurs exactly one time. In spite of each sequence being unique, there could still be a higher degree of similarity between many of them. What is the best algorithm to assess the level of similarity in such a list ?

To simplify the problem (and solution), let's fist look only at sequence differences with a single base mismatch, and/or a single gap. That should help to expand to higher divergences. I am not really interested in anything that goes beond 2 mismatches or gaps.

I was thinking about a matrix-style output. Could probably come up with some code myself, but a really fast solution is needed here, since I'm dealing with lists of up to a million 20-mer sequences. I was thinking about dendrogram algorithms but that idea didn't get me very far.

sequence R sequencing • 1.8k views

ADD COMMENT • link updated 5.6 years ago by Damian Kao 16k • written 5.7 years ago by meraner ▴ 20

1

Entering edit mode

Hi jrj.healey,

thanks for your response ! I set up an implementation of the Levenshtein algorithm that works but speed is still a problem. However, I feel there is room for improvement because I don't need precise information beyond Levenshtein distances of 2.

On Wikipedia I found a matrix-based algorithm (based on the Wagner–Fischer algorithm) that I implemented it in C++. Matrix values are filled in in progressing squares from the top left corner. Since I am only interested in Levenshtein distances of 0, 1, and 2, I should gain speed by thresholding the calculations accordingly.

But I don't see any speedup - calculating the full matrix (for a test set of hundred 20-mers) is about the same as calculating only part of the matrix. The reason must lie in the implementation. I set up a Linux bash script that feeds 20-mers into the Levenshtein function. Am I better offf writing everything in C++ ? Or is there another programming language that is particularly suited for this ?

ADD REPLY • link 5.6 years ago by meraner ▴ 20

1

Entering edit mode

I’m no expert on algorithm optimisation.

If you were clever about partitioning your pairwise comparisons you might be able to brute force the process by simply parallelising the function. Perhaps that would be enough?

How have you implemented your Levenshtein function? In C++?

See: https://stackoverflow.com/questions/16278874/how-to-speed-up-levenshtein-distance-calculation

ADD REPLY • link 5.6 years ago by Joe 21k

0

Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

This should have been a comment under @healey's answer.

ADD REPLY • link 5.6 years ago by GenoMax 141k

0

Entering edit mode

What is the underlying problem? What is the source of the kmer list? How the lengthier description of the kmer question relates to CRISPR screen?

ADD REPLY • link 5.7 years ago by h.mon 35k

score 1 · Answer 1 · 2018-08-26

Use CD-HIT and cluster all your sequences at whatever threshold you desire. It will group the similar sequences together, and also give you a representative sequence for all clusters.

You could hand-code something using Levenshtein or Hamming distances, they’re famously not very fast however, as string comparisons rarely are.

score 0 · Answer 2 · 2018-09-26

0

Entering edit mode

5.6 years ago

Damian Kao 16k

Starcode is for UMI clustering, but it should also work for your purposes:

https://github.com/gui11aume/starcode

ADD COMMENT • link 5.6 years ago by Damian Kao 16k