Question: Is There A Fast Hashing Function For Nucleotide K-Mers (Q-Grams)?
17
gravatar for Ketil
6.0 years ago by
Ketil3.8k
Germany
Ketil3.8k wrote:

I would like to index all k-mers in a set of nucleotide sequences. I could use a generic string based hash function, but my experiments indicate that a function leveraging the fact that the k-mers are overlapping might be faster. Also, I'd like a k-mer and its reverse complement to hash to the same value.

To be precise, I am not looking for hash table implementations, or applications, just a fast function from k-mer to a hash value, so that the value for a k-mer and its reverse complement is the same.

I can invent my own function for this (I have, as a matter of fact), but surely, something like this exists already?

index • 7.4k views
ADD COMMENTlink modified 6.0 years ago by Micans260 • written 6.0 years ago by Ketil3.8k

have you any solution for this? what's the hashing functions for DNA k-mers?? Thanks

ADD REPLYlink written 12 months ago by midox130
10
gravatar for Brad Chapman
6.0 years ago by
Brad Chapman9.1k
Boston, MA
Brad Chapman9.1k wrote:

Titus Brown's khmer is C++ wrapped in Python:

https://github.com/ctb/khmer

and is used for attacking next-gen assembly challenges:

http://ivory.idyll.org/blog/jul-10/kmer-filtering

http://www.slideshare.net/c.titus.brown/climbing-mt-metagenome

http://www.slideshare.net/c.titus.brown/pycon-2011-talk-ngram-assembly-with-bloom-filters

ADD COMMENTlink written 6.0 years ago by Brad Chapman9.1k

Fun to see bloom filters gaining some use! This is in fact what I want the hashing function for (although it will be rehashed by some other function later).

ADD REPLYlink written 6.0 years ago by Ketil3.8k

khmer uses darcs as its RCS. Yay!

ADD REPLYlink written 5.9 years ago by Ketil3.8k

FYI: The correct url for the khmer project is https://github.com/ged-lab/khmer

ADD REPLYlink written 3.4 years ago by Michael R. Crusoe400
5
gravatar for Micans
6.0 years ago by
Micans260
Micans260 wrote:

I have used the simple map from A,C,G,T onto {00, 01, 10, 11} (in bits), allowing four bases encoded in a byte. You could then use hash(word) + hash(rc(word)), or any other symmetric function. This approach has the advantage that you can use the current hash to find the next hash; as you shift bases in and out of your k-mer, you shift bits in and out of your hash or hash constituents. This is the approach taken in sylamer, and it's pretty fast.

ADD COMMENTlink written 6.0 years ago by Micans260

I used this approach recently in my final year project and my lecturer was left speechless at the speed/accuracy of my hash function. Not sure why this isn't a standard!

ADD REPLYlink written 6.0 years ago by Mark Anthony Gibbins130

If you want identical hashes for reverse-complements, couldn't you just use 0 for A/T and 1 for G/C, instead of using two bits per nucleotide?

ADD REPLYlink written 6.0 years ago by Ryan Thompson3.2k

Yes, I've used this too. The main problems with this is that you risk overflowing a fixed size data type (so k must not be greater than bits/2). For large k, you need to use an arbitrary precision integer, which slows things down. Also, the incremental calculation of next hash involves two divisions and some other arithmetic. In short - I think it's possible to do better.

ADD REPLYlink written 5.9 years ago by Ketil3.8k

Yes, those limitations are evident. In the context of Sylamer we do not support K-mers of length > 16 currently, but on 64-bit machines you can support words of lenght 32 natively. For large k I do not use all bits and resort to true hashing (the size of the hash array defined by the amount of sequence to be analysed). This solution obviously fits some applications and not others, but in terms of speed I will be well impressed if you can do substantially better than 'hid = ((hid << 2) | b) & LOMEGA(K)' (no divisions).

ADD REPLYlink written 5.9 years ago by Micans260

No, that's pretty good. I though that to support simultaneous calculation of the reverse complement hash you'd need a division. But now I think you can do that too with shifts and bitwise ops. Good catch!

ADD REPLYlink written 5.9 years ago by Ketil3.8k

I did an implementation of this in Haskell - normally, I'd expect a high level language to be less efficient for this, but it turns out it is fast enough (meaning that I haven't found an associative data structure that won't be dramatically slower than the hashing). I can hash about 40MB/s on my laptop, doing both forward and reverse complement, and avoiding Ns. Almost twice as fast without those restrictions.

Oh, and I map A: 00, C: 11, G: 01, T:10 - this makes it easy (bitwise operations) to complement and count GC content, and differentiate pyrimidines and purines.

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Ketil3.8k

I guess this is a variation of rolling hash? Can you give more specific description on implementation? Thanks!

ADD REPLYlink written 2.8 years ago by Griffan60

Still here? :-) Yes, I maintain the current hash in forward and reverse complement, shift left/right and chop off the end, add next base, and store the numerically smallest hash. Code at https://github.com/ketil-malde/kmx.

ADD REPLYlink written 11 weeks ago by Ketil3.8k
2
gravatar for Drio
6.0 years ago by
Drio890
United States
Drio890 wrote:

Take a look to these benchmark results. Particularly the last benchmark, dict. It compares different hashing libraries in different programming languages.

If you want speed, use khash. That's the hash table implementation used in the c entry for that benchmark.

Also check this post. It has a very nice review of C/C++ hash table libraries.

ADD COMMENTlink written 6.0 years ago by Drio890

Thanks - but I don't really want a hash table implementation, I just what a hashing function.

ADD REPLYlink written 6.0 years ago by Ketil3.8k
0
gravatar for Jts
6.0 years ago by
Jts1.1k
Jts1.1k wrote:

I have used one of Bob Jenkin's hash functions (lookup3/8) for k-mer hashing with good results:

http://burtleburtle.net/bob/hash/

I have also heard good things about murmer hash but I have not tried it myself:

http://code.google.com/p/smhasher/

As for dedicated k-mer hashes see this paper by Guillaume Marçais and Carl Kingsford:

http://bioinformatics.oxfordjournals.org/content/27/6/764

In section 3.4 they discuss their design of a k-mer specific hashing function.

ADD COMMENTlink written 6.0 years ago by Jts1.1k

What hash compression technique do you use for the hash values returned by lookup3/8? How do you determine the hashtable size?

Thanks in advance.

ADD REPLYlink written 4 weeks ago by germelcar10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 488 users visited in the last hour