Question: hash function for DNA sequences
1
midox260 wrote:

hello, i need a good hash function for DNA sequences. i cant found in web. can you propose me a good and efficient hash function??

Thank you

dna hashing hash • 6.4k views
written 4.7 years ago by midox260
4

What for, what is your type of application? What's relevant for efficiency for you? Computational complexity, time, memory, cryptographic quality, expected number of collisions? What is your key size, input size? Specific programming languages?

2

I want to store in a hash table a k-mers. I want a good hash function in time and memory. I do not know yet the number of colisions. I program in C. thanks.

2

If you just want to store kmers in a table, don't hash them. Just encode them to a number A → 0, C → 1, … T → 3 and weight each digit accordingly.

no, i need to hash them because i want to compare sequences using the hash table. for this, i need a good hash function.

3

The trivial hash is also a perfect one, it won't have collisions, so you can compare sequences. Also it is normally also cheapest to compute. If you only want to encode ACGT (no ambiguity), each base can be encoded as 2-bit. That means that 32 bits could address all 16 mers, and a 64-bit integer k = 32

maybe i choose my k=8. can you explain me what's a trivial hash?? thanks

3

Take for instance the kmer K=GTCGAATC (size 8)

If you take by convention A=0, C=1, G=2, T=3, you can encode K as a polynomial in base 4 like this:

``````K = G.4^0 + T.4^1 + C.4^2 + G.4^3 + A.4^4 + A.4^5 + T.4^6 + C.4^7
K = 2*4^0 + 3*4^1 + 1*4^2 + 2*4^3 + 0*4^4 + 0*4^5 + 3*4^6 + 1*4^7
K = 28830
``````

You can thus encode any 8-mer as a unique integer value.

ok thank you for your explanations. if i have collisions, what do I have in this case? because in my sequences i have ~15% of errors. Thanks

2

With this technique the hash is bijective and thus no collisions occur for fixed k.

ok.if i have collisions? what do I have in this case? because error cause collisions!

2

I think this is a misunderstanding. Sequencing errors do not cause hash collisions. A collision in the sense of the topic would be like h(s): S -> H , there exists s, v in S, s != v AND h(s) = h(v)

A seq. error causes a sequence to contain a 'wrong' k-mer (ACgT) while the 'true' sequence is (ACAT). You have no chance to differentiate the two by means of a hashing function.

have you some papers for hash that use nucleotides ??

Look for the khmer citation.

i see the citation but i not found!! have you another?? Thanks

what do u mean by 'i not found'? the paper is there, if u need more try the other thread or any of the papers in the references.

1

If you truly need a hash then I've found murmur3 to be quick and easy to use in C.

i dont undestand the fonctionnality of Murmurhash. can you simplify me ?? THanks

1

It's a generic hash function, so you can support arbitrarily long kmers (or anything else, for that matter).

ok thanks. do i need to undestand the algorithm? because in https://en.wikipedia.org/wiki/MurmurHash I did not understand the algorithm or the code! can you explain me if you can ?

2

Don't bother trying to understand the hash function, just use it if you need it. This has good performance and is fairly widely used. I've used it in conjunction with the 2-bit encoding that Michael and the others have mentioned (i.e., encoding an A as 0, C as 1, and so on) and been happy with the results. Certainly if you can get away with just using the 2-bit encoded value then do it, you're unlikely to get better performance than that.

OK thankyou. Just a few questions: what is the length of k-mers that can support MurMur ?? and can i use it on nucleotides ?? thanks for your responses.

1

You can use any length of anything, it's a generic hashing function.

in the focntion of MurMurHash there are (key, len, seed) the Key is the string , len is the lenght and what's seed? Thanks

2

Seed is a random value that's used in the hashing. I just use `uint32_t seed = 0xAAAAAAAA;`.

if i use machine 64 what's the random value can i use?

2

There's no difference between 32/64 bit machines with this, you can use the same value (e.g., 0xAAAAAAAA).

1

On a 64bit machine, the most random value is 0x8000000000000000, plus or minus 0x8000000000000000. My personal fav is 0x0123456789ABCDEF, but dont tell anyone because every time some uses it, it becomes less random.

1

We need to add `<sarcasm>` tags to biostars.

Hahaha, yes :) Or perhaps just a catch-all `<bad advice>` tag that I could encapsulate all my posts in :P

it's logic to have a negative values by murmur?? because I have positive values and negative values !!!

1

Presumably you need to recast to an unsigned integer. My code for this is:

``````uint64_t hashString(char *s) {
int len = strlen(s);
uint64_t hash_val;
uint32_t seed = 0xAAAAAAAA;
#if UINTPTR_MAX == 0xffffffff
MurmurHash3_x86_128((void *) s, len, seed, (void *) &hash_val);
#else
MurmurHash3_x64_128((void *) s, len, seed, (void *) &hash_val);
#endif
return hash_val;
}
``````

no i use machine 32 bit. i use the code in wikipedia website

Then use a `uint32_t` rather than an `int32_t`.

yes in my function i used uint32_t but i have negative value. I think these are a false results???

1

I guess you are printing you results wrong, because the `u` in `uint32_t` means `unsigned` i.e. no negative values.

That's not even a theoretical possibility. Presumably you're converting that to a signed value at some point.

no, my value is like this in C code: uint32_t Value=0; Value=murmur('ACTGCAC',7,0xAAAAAAAA); printf("%d",Value); and here the result is negative value!!!

1

`printf("%"PRIu32, Value);`

You're converting an unsigned 32-bit integer to a signed `int` in what you're doing...

1

what's printf("%"PRIu32, Value); it doesn't work in C!

See here and here.

I'm somewhat amazed that you got things compiled without importing `inttypes.h`...

no we juste use "%u" for unseigned. thanks

Hello Devon.

Could you please share what the hash compression function do you use in order to assign and index for the hash value returned by murmur? What size of the hash table do you use?

1

So, fixed length input, and a hash table, I conclude you should optimize for lookup time/hash compute time, that would be the trivial hash algorithm, means you won't need any complex hashing function. Convert the k-mer into a 32-bit integer (if it fits) value and use the number to address an array...

1

The up2bit format converts every sequence of DNA into a unique number, much like the 2bit format suggested by kloetzi, except the up2bit format costs 2 bits more, and allows you to store DNA of a variable length. It is stable over different struct sizes/types, (stuffing it into a signed or unsigned int wont matter, 'ATCG' has the same value in int16 as it does in int128, etc), and since its just bit shifting its faster than polynomial math.

However, its not a one-way hashing function since it can be reversed, so perhaps you were looking for another one of the many things Michael points.

code in python:

``````def dna_up2bit(DNA):
up2bit = 1
for twobit in [ ('A','C','T','G').index(char) for char in reversed(DNA) ]:
up2bit = (up2bit << 2) + twobit
return up2bit
``````
1

See MinHash

1
1

That thread points out that if you want strand-unspecific hashing, make A/T 0 and C/G 1. I think thats pretty awesome :D Dump twice as much DNA in each hash.

Are the generic functions (e.g., murmur hash) not sufficing for you?

Hello, I have a problem the hash function Murmur (). if I take k = 3 (lenght), the hash function normally gives values between 0 and 255. but I have values in millions and billions. Does this make sense?

1

If you have more specific problems about implementation of this algorithm I would recommend to move over to stackexchange. However, this is a good hint as to why the use of a real hash is overkill, there are 64 unique 3-mers, why use the long integer to store them? The hash function is for arbitrary length input, so it doesn't take the k-mer length into account.

Sure that makes sense, the values should be distributed over the full range of possible integers.

Of course if everything is of length 3 then just do the 2-bit trick that everyone else mentioned. That'll be faster. It's when the length is too long or you have variable sized inputs that you need a generic hash function.

1

The up2bit code I pasted above does variable-length inputs :( I think the only time you need a generic hash-function when the OP asks for a generic hash function :P