hash function for DNA sequences
0
1
Entering edit mode
6.1 years ago
midox ▴ 290

hello, i need a good hash function for DNA sequences. i cant found in web. can you propose me a good and efficient hash function??

Thank you

hash hashing DNA • 8.7k views
4
Entering edit mode

What for, what is your type of application? What's relevant for efficiency for you? Computational complexity, time, memory, cryptographic quality, expected number of collisions? What is your key size, input size? Specific programming languages?

2
Entering edit mode

I want to store in a hash table a k-mers. I want a good hash function in time and memory. I do not know yet the number of colisions. I program in C. thanks.

2
Entering edit mode

If you just want to store kmers in a table, don't hash them. Just encode them to a number A → 0, C → 1, … T → 3 and weight each digit accordingly.

0
Entering edit mode

no, i need to hash them because i want to compare sequences using the hash table. for this, i need a good hash function.

3
Entering edit mode

The trivial hash is also a perfect one, it won't have collisions, so you can compare sequences. Also it is normally also cheapest to compute. If you only want to encode ACGT (no ambiguity), each base can be encoded as 2-bit. That means that 32 bits could address all 16 mers, and a 64-bit integer k = 32

0
Entering edit mode

maybe i choose my k=8. can you explain me what's a trivial hash?? thanks

3
Entering edit mode

Take for instance the kmer K=GTCGAATC (size 8)

If you take by convention A=0, C=1, G=2, T=3, you can encode K as a polynomial in base 4 like this:

K = G.4^0 + T.4^1 + C.4^2 + G.4^3 + A.4^4 + A.4^5 + T.4^6 + C.4^7
K = 2*4^0 + 3*4^1 + 1*4^2 + 2*4^3 + 0*4^4 + 0*4^5 + 3*4^6 + 1*4^7
K = 28830


You can thus encode any 8-mer as a unique integer value.

0
Entering edit mode

ok thank you for your explanations. if i have collisions, what do I have in this case? because in my sequences i have ~15% of errors. Thanks

2
Entering edit mode

With this technique the hash is bijective and thus no collisions occur for fixed k.

0
Entering edit mode

ok.if i have collisions? what do I have in this case? because error cause collisions!

2
Entering edit mode

I think this is a misunderstanding. Sequencing errors do not cause hash collisions. A collision in the sense of the topic would be like h(s): S -> H , there exists s, v in S, s != v AND h(s) = h(v)

A seq. error causes a sequence to contain a 'wrong' k-mer (ACgT) while the 'true' sequence is (ACAT). You have no chance to differentiate the two by means of a hashing function.

0
Entering edit mode

have you some papers for hash that use nucleotides ??

0
Entering edit mode

Look for the khmer citation.

0
Entering edit mode

i see the citation but i not found!! have you another?? Thanks

0
Entering edit mode

what do u mean by 'i not found'? the paper is there, if u need more try the other thread or any of the papers in the references.

1
Entering edit mode

If you truly need a hash then I've found murmur3 to be quick and easy to use in C.

0
Entering edit mode

i dont undestand the fonctionnality of Murmurhash. can you simplify me ?? THanks

1
Entering edit mode

It's a generic hash function, so you can support arbitrarily long kmers (or anything else, for that matter).

0
Entering edit mode

ok thanks. do i need to undestand the algorithm? because in https://en.wikipedia.org/wiki/MurmurHash I did not understand the algorithm or the code! can you explain me if you can ?

2
Entering edit mode

Don't bother trying to understand the hash function, just use it if you need it. This has good performance and is fairly widely used. I've used it in conjunction with the 2-bit encoding that Michael and the others have mentioned (i.e., encoding an A as 0, C as 1, and so on) and been happy with the results. Certainly if you can get away with just using the 2-bit encoded value then do it, you're unlikely to get better performance than that.

0
Entering edit mode

OK thankyou. Just a few questions: what is the length of k-mers that can support MurMur ?? and can i use it on nucleotides ?? thanks for your responses.

1
Entering edit mode

You can use any length of anything, it's a generic hashing function.

0
Entering edit mode

in the focntion of MurMurHash there are (key, len, seed) the Key is the string , len is the lenght and what's seed? Thanks

2
Entering edit mode

Seed is a random value that's used in the hashing. I just use uint32_t seed = 0xAAAAAAAA;.

0
Entering edit mode

if i use machine 64 what's the random value can i use?

2
Entering edit mode

There's no difference between 32/64 bit machines with this, you can use the same value (e.g., 0xAAAAAAAA).

1
Entering edit mode

On a 64bit machine, the most random value is 0x8000000000000000, plus or minus 0x8000000000000000. My personal fav is 0x0123456789ABCDEF, but dont tell anyone because every time some uses it, it becomes less random.

1
Entering edit mode

We need to add <sarcasm> tags to biostars.

0
Entering edit mode

Hahaha, yes :) Or perhaps just a catch-all <bad advice> tag that I could encapsulate all my posts in :P

0
Entering edit mode

it's logic to have a negative values by murmur?? because I have positive values and negative values !!!

1
Entering edit mode

Presumably you need to recast to an unsigned integer. My code for this is:

uint64_t hashString(char *s) {
int len = strlen(s);
uint64_t hash_val[2];
uint32_t seed = 0xAAAAAAAA;
#if UINTPTR_MAX == 0xffffffff
MurmurHash3_x86_128((void *) s, len, seed, (void *) &hash_val);
#else
MurmurHash3_x64_128((void *) s, len, seed, (void *) &hash_val);
#endif
return hash_val[0];
}

0
Entering edit mode

no i use machine 32 bit. i use the code in wikipedia website

0
Entering edit mode

Then use a uint32_t rather than an int32_t.

0
Entering edit mode

yes in my function i used uint32_t but i have negative value. I think these are a false results???

1
Entering edit mode

I guess you are printing you results wrong, because the u in uint32_t means unsigned i.e. no negative values.

0
Entering edit mode

That's not even a theoretical possibility. Presumably you're converting that to a signed value at some point.

0
Entering edit mode

no, my value is like this in C code: uint32_t Value=0; Value=murmur('ACTGCAC',7,0xAAAAAAAA); printf("%d",Value); and here the result is negative value!!!

1
Entering edit mode

printf("%"PRIu32, Value);

You're converting an unsigned 32-bit integer to a signed int in what you're doing...

1
Entering edit mode

0
Entering edit mode

what's printf("%"PRIu32, Value); it doesn't work in C!

0
Entering edit mode

See here and here.

0
Entering edit mode

I'm somewhat amazed that you got things compiled without importing inttypes.h...

0
Entering edit mode

no we juste use "%u" for unseigned. thanks

0
Entering edit mode

Hello Devon.

Could you please share what the hash compression function do you use in order to assign and index for the hash value returned by murmur? What size of the hash table do you use?

1
Entering edit mode

So, fixed length input, and a hash table, I conclude you should optimize for lookup time/hash compute time, that would be the trivial hash algorithm, means you won't need any complex hashing function. Convert the k-mer into a 32-bit integer (if it fits) value and use the number to address an array...

1
Entering edit mode

The up2bit format converts every sequence of DNA into a unique number, much like the 2bit format suggested by kloetzi, except the up2bit format costs 2 bits more, and allows you to store DNA of a variable length. It is stable over different struct sizes/types, (stuffing it into a signed or unsigned int wont matter, 'ATCG' has the same value in int16 as it does in int128, etc), and since its just bit shifting its faster than polynomial math.

However, its not a one-way hashing function since it can be reversed, so perhaps you were looking for another one of the many things Michael points.

code in python:

def dna_up2bit(DNA):
up2bit = 1
for twobit in [ ('A','C','T','G').index(char) for char in reversed(DNA) ]:
up2bit = (up2bit << 2) + twobit
return up2bit

1
Entering edit mode

See MinHash

1
Entering edit mode
1
Entering edit mode

That thread points out that if you want strand-unspecific hashing, make A/T 0 and C/G 1. I think thats pretty awesome :D Dump twice as much DNA in each hash.

0
Entering edit mode

Are the generic functions (e.g., murmur hash) not sufficing for you?

0
Entering edit mode

Hello, I have a problem the hash function Murmur (). if I take k = 3 (lenght), the hash function normally gives values between 0 and 255. but I have values in millions and billions. Does this make sense?

1
Entering edit mode

If you have more specific problems about implementation of this algorithm I would recommend to move over to stackexchange. However, this is a good hint as to why the use of a real hash is overkill, there are 64 unique 3-mers, why use the long integer to store them? The hash function is for arbitrary length input, so it doesn't take the k-mer length into account.

0
Entering edit mode

Sure that makes sense, the values should be distributed over the full range of possible integers.

Of course if everything is of length 3 then just do the 2-bit trick that everyone else mentioned. That'll be faster. It's when the length is too long or you have variable sized inputs that you need a generic hash function.

1
Entering edit mode

The up2bit code I pasted above does variable-length inputs :( I think the only time you need a generic hash-function when the OP asks for a generic hash function :P

0
Entering edit mode

So what's the solution for DNA k-mer if a i have a small k. do i change the hash function???