20-mer DNA sequence list - assess similarity
2
1
Entering edit mode
5.7 years ago
meraner ▴ 20

I perform CRISPR screens in mammalian cells and also analyze screen data (using linux, bowtie2, samtools, and R).

Here is my problem: Imagine a long list of 20-mer DNA sequences (or sgRNA sequences) where each sequence occurs exactly one time. In spite of each sequence being unique, there could still be a higher degree of similarity between many of them. What is the best algorithm to assess the level of similarity in such a list ?

To simplify the problem (and solution), let's fist look only at sequence differences with a single base mismatch, and/or a single gap. That should help to expand to higher divergences. I am not really interested in anything that goes beond 2 mismatches or gaps.

I was thinking about a matrix-style output. Could probably come up with some code myself, but a really fast solution is needed here, since I'm dealing with lists of up to a million 20-mer sequences. I was thinking about dendrogram algorithms but that idea didn't get me very far.

sequence R sequencing • 1.8k views
ADD COMMENT
1
Entering edit mode

Hi jrj.healey,

thanks for your response ! I set up an implementation of the Levenshtein algorithm that works but speed is still a problem. However, I feel there is room for improvement because I don't need precise information beyond Levenshtein distances of 2.

On Wikipedia I found a matrix-based algorithm (based on the Wagner–Fischer algorithm) that I implemented it in C++. Matrix values are filled in in progressing squares from the top left corner. Since I am only interested in Levenshtein distances of 0, 1, and 2, I should gain speed by thresholding the calculations accordingly.

But I don't see any speedup - calculating the full matrix (for a test set of hundred 20-mers) is about the same as calculating only part of the matrix. The reason must lie in the implementation. I set up a Linux bash script that feeds 20-mers into the Levenshtein function. Am I better offf writing everything in C++ ? Or is there another programming language that is particularly suited for this ?

ADD REPLY
1
Entering edit mode

I’m no expert on algorithm optimisation.

If you were clever about partitioning your pairwise comparisons you might be able to brute force the process by simply parallelising the function. Perhaps that would be enough?

How have you implemented your Levenshtein function? In C++?

See: https://stackoverflow.com/questions/16278874/how-to-speed-up-levenshtein-distance-calculation

ADD REPLY
0
Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

This should have been a comment under @healey's answer.

ADD REPLY
0
Entering edit mode

What is the underlying problem? What is the source of the kmer list? How the lengthier description of the kmer question relates to CRISPR screen?

ADD REPLY
1
Entering edit mode
5.7 years ago
Joe 21k

Use CD-HIT and cluster all your sequences at whatever threshold you desire. It will group the similar sequences together, and also give you a representative sequence for all clusters.

You could hand-code something using Levenshtein or Hamming distances, they’re famously not very fast however, as string comparisons rarely are.

ADD COMMENT
0
Entering edit mode
5.6 years ago

Starcode is for UMI clustering, but it should also work for your purposes:

https://github.com/gui11aume/starcode

ADD COMMENT

Login before adding your answer.

Traffic: 2942 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6