Generating All K-Mers Having N Mismatch Positions To A Kmer Using C/C++
1
2
Entering edit mode
12.0 years ago

For any k and n (positive integer values ), there are A=k C n possible k-mers having a mismatch of n to a kmer T. I am not sure how to generate those A number of kmers? I am thinking to store all possible length K kmers in a suffix tree and use it to find all k-mers having n mismatches to a query kmer A. or Just simply store all possible Kmers in an array and do an exhaustive comparisons. Are there better ways, using C/C++ language for 6 <= k <= 12?

Thanks.

• 4.7k views
ADD COMMENT
0
Entering edit mode

Enumerating all the combinations is a very classical problem in computer science. You can find the solution to this generalized problem with google. You can do that with suffix tree, but it is probably much easier to solve the generalized problem.

ADD REPLY
1
Entering edit mode
12.0 years ago
SES 8.6k

Instead of trying to implement your own solution, I recommend you try Vmatch, which is a powerful and versatile tool for these kinds of matching tasks. The general steps involved would be to 1) create an index of one set of sequences with the program mkvtree (i.e., make the virtual suffix trees) and then 2) use the program vmatch with the option "-identity" to control the match behavior. I'm not 100% clear on what you are trying to do so I can't go into more details, but I am pretty certain that if you go through the documentation (95 pages worth) for Vmatch you will be able to find a solution. Also, keep in mind that if you are enumerating all k-mers up to a certain mismatch level from a large sequence set then you may be trying a very, very large number of matches.

ADD COMMENT
1
Entering edit mode

The problem I wanted to use on that solution is as following:

I have counted the frequencies of k-mers from a large set of DNA sequences stored in a file, then given a k-mer L I would like to find the frequency of E(L, n), i.e., set of all k-mers having n-mismatch to L, using that pre-computed frequency file.

So I am thinking to store those pre-computed k-mers frequencies in a hash-table, and then enumerate E(L,n) and then find the frequency of each T in E(L,n) using the hash table.

The vmatch indeed can solve this problem.

ADD REPLY

Login before adding your answer.

Traffic: 2701 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6