I perform CRISPR screens in mammalian cells and also analyze screen data (using linux, bowtie2, samtools, and R).
Here is my problem: Imagine a long list of 20-mer DNA sequences (or sgRNA sequences) where each sequence occurs exactly one time. In spite of each sequence being unique, there could still be a higher degree of similarity between many of them. What is the best algorithm to assess the level of similarity in such a list ?
To simplify the problem (and solution), let's fist look only at sequence differences with a single base mismatch, and/or a single gap. That should help to expand to higher divergences. I am not really interested in anything that goes beond 2 mismatches or gaps.
I was thinking about a matrix-style output. Could probably come up with some code myself, but a really fast solution is needed here, since I'm dealing with lists of up to a million 20-mer sequences. I was thinking about dendrogram algorithms but that idea didn't get me very far.