kmer alignment with mismatch
2
3
Entering edit mode
8.2 years ago
Vince ▴ 150

Hi,

I have a list of kmers, between 8-12 nt in length, and I would like to align these to a larger sequence returning all ungapped matches with at most 2 mismatches. I would like search to be exhaustive i.e. I do not want to miss anything. I wrote a python script to compute hamming distance for all substrings of my reference to the query, but it is too slow for many (1000s) queries on a reference of ~100,000nt.

What program would you recommend that does this and runs rather quickly. I have looked into Bowtie2, but I am unsure if it was designed to work with such short query sequences.

Thanks for the feedback.

kmer alignmnt mismatch • 4.4k views
ADD COMMENT
3
Entering edit mode
8.2 years ago
Vince ▴ 150

Ended up creating a custom python script that:

  1. Broke up the larger sequence into kmers of required sizes into a set.
  2. For each query kmer, compute possible all 2-mismatch kmers into a set.
  3. interesect set from 1 with set from 2.

Works extremely quickly as my kmers are quite small (8-12 nt) and the search target is also relatively small (tens of kb).

ADD COMMENT
1
Entering edit mode

Your solution sounds like the best approach. Incidentally, you can generate mutant kmers with BBDuk, like this:

bbduk.sh ref=sequences.fasta dump=kmers.fasta k=12 hdist=2 mm=f

The number of mutations is specified as hdist (hamming distance).

ADD REPLY
2
Entering edit mode
8.2 years ago

Good you found a solution. For this sort of things vmatch is a useful program to know about. It has extensive documentation and it seems to me it's really well written and maintained. You need to ask for a license key (free for academic use).

In your case the command would probably be:

vmatch -v -e 2 -d -p -showdesc 0 -complete -q query.fa reference.mkvtree

with reference.mkvtree being the indexed reference.

ADD COMMENT

Login before adding your answer.

Traffic: 1976 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6