Question

kmer alignment with mismatch

3

Entering edit mode

8.2 years ago

Vince ▴ 150

Hi,

I have a list of kmers, between 8-12 nt in length, and I would like to align these to a larger sequence returning all ungapped matches with at most 2 mismatches. I would like search to be exhaustive i.e. I do not want to miss anything. I wrote a python script to compute hamming distance for all substrings of my reference to the query, but it is too slow for many (1000s) queries on a reference of ~100,000nt.

What program would you recommend that does this and runs rather quickly. I have looked into Bowtie2, but I am unsure if it was designed to work with such short query sequences.

Thanks for the feedback.

kmer alignmnt mismatch • 4.4k views

ADD COMMENT • link updated 8.2 years ago by dariober 14k • written 8.2 years ago by Vince ▴ 150

2

Entering edit mode

8.2 years ago

dariober 14k

Good you found a solution. For this sort of things vmatch is a useful program to know about. It has extensive documentation and it seems to me it's really well written and maintained. You need to ask for a license key (free for academic use).

In your case the command would probably be:

vmatch -v -e 2 -d -p -showdesc 0 -complete -q query.fa reference.mkvtree

with reference.mkvtree being the indexed reference.

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 8.2 years ago by dariober 14k

Ram · Accepted Answer · 2016-02-05

3

Entering edit mode

8.2 years ago

Vince ▴ 150

Ended up creating a custom python script that:

Broke up the larger sequence into kmers of required sizes into a set.
For each query kmer, compute possible all 2-mismatch kmers into a set.
interesect set from 1 with set from 2.

Works extremely quickly as my kmers are quite small (8-12 nt) and the search target is also relatively small (tens of kb).

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 8.2 years ago by Vince ▴ 150

1

Entering edit mode

Your solution sounds like the best approach. Incidentally, you can generate mutant kmers with BBDuk, like this:

bbduk.sh ref=sequences.fasta dump=kmers.fasta k=12 hdist=2 mm=f

The number of mutations is specified as hdist (hamming distance).

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.2 years ago by Brian Bushnell 20k