kmer alignment with mismatch
5.7 years ago
Vince

Hi,

I have a list of kmers, between 8-12 nt in length, and I would like to align these to a larger sequence returning all ungapped matches with at most 2 mismatches. I would like search to be exhaustive i.e. I do not want to miss anything. I wrote a python script to compute hamming distance for all substrings of my reference to the query, but it is too slow for many (1000s) queries on a reference of ~100,000nt.

What program would you recommend that does this and runs rather quickly. I have looked into Bowtie2, but I am unsure if it was designed to work with such short query sequences.

Thanks for the feedback.

5.7 years ago
Vince

Ended up creating a custom python script that:

1. Broke up the larger sequence into kmers of required sizes into a set.
2. For each query kmer, compute possible all 2-mismatch kmers into a set.
3. interesect set from 1 with set from 2.

Works extremely quickly as my kmers are quite small (8-12 nt) and the search target is also relatively small (tens of kb).

Your solution sounds like the best approach. Incidentally, you can generate mutant kmers with BBDuk, like this:

bbduk.sh ref=sequences.fasta dump=kmers.fasta k=12 hdist=2 mm=f


The number of mutations is specified as hdist (hamming distance).

5.7 years ago

Good you found a solution. For this sort of things vmatch is a useful program to know about. It has extensive documentation and it seems to me it's really well written and maintained. You need to ask for a license key (free for academic use).

In your case the command would probably be:

vmatch -v -e 2 -d -p -showdesc 0 -complete -q query.fa reference.mkvtree


with reference.mkvtree being the indexed reference.