Question: hash function for DNA sequences
0
gravatar for midox
15 months ago by
midox180
Tunisia
midox180 wrote:

hello, i need a good hash function for DNA sequences. i cant found in web. can you propose me a good and efficient hash function??

Thank you

dna hashing hash • 1.4k views
ADD COMMENTlink written 15 months ago by midox180
4

What for, what is your type of application? What's relevant for efficiency for you? Computational complexity, time, memory, cryptographic quality, expected number of collisions? What is your key size, input size? Specific programming languages?

ADD REPLYlink modified 15 months ago • written 15 months ago by Michael Dondrup42k
2

I want to store in a hash table a k-mers. I want a good hash function in time and memory. I do not know yet the number of colisions. I program in C. thanks.

ADD REPLYlink written 15 months ago by midox180
2

If you just want to store kmers in a table, don't hash them. Just encode them to a number A → 0, C → 1, … T → 3 and weight each digit accordingly.

ADD REPLYlink written 15 months ago by kloetzl630

no, i need to hash them because i want to compare sequences using the hash table. for this, i need a good hash function.

ADD REPLYlink written 15 months ago by midox180
3

The trivial hash is also a perfect one, it won't have collisions, so you can compare sequences. Also it is normally also cheapest to compute. If you only want to encode ACGT (no ambiguity), each base can be encoded as 2-bit. That means that 32 bits could address all 16 mers, and a 64-bit integer k = 32

ADD REPLYlink modified 15 months ago • written 15 months ago by Michael Dondrup42k

maybe i choose my k=8. can you explain me what's a trivial hash?? thanks

ADD REPLYlink written 15 months ago by midox180
3

Take for instance the kmer K=GTCGAATC (size 8)

If you take by convention A=0, C=1, G=2, T=3, you can encode K as a polynomial in base 4 like this:

K = G.4^0 + T.4^1 + C.4^2 + G.4^3 + A.4^4 + A.4^5 + T.4^6 + C.4^7
K = 2*4^0 + 3*4^1 + 1*4^2 + 2*4^3 + 0*4^4 + 0*4^5 + 3*4^6 + 1*4^7
K = 28830

You can thus encode any 8-mer as a unique integer value.

ADD REPLYlink modified 15 months ago • written 15 months ago by edrezen680

ok thank you for your explanations. if i have collisions, what do I have in this case? because in my sequences i have ~15% of errors. Thanks

ADD REPLYlink modified 15 months ago • written 15 months ago by midox180
2

With this technique the hash is bijective and thus no collisions occur for fixed k.

ADD REPLYlink written 15 months ago by kloetzl630

ok.if i have collisions? what do I have in this case? because error cause collisions!

ADD REPLYlink written 15 months ago by midox180
2

I think this is a misunderstanding. Sequencing errors do not cause hash collisions. A collision in the sense of the topic would be like h(s): S -> H , there exists s, v in S, s != v AND h(s) = h(v)

A seq. error causes a sequence to contain a 'wrong' k-mer (ACgT) while the 'true' sequence is (ACAT). You have no chance to differentiate the two by means of a hashing function.

ADD REPLYlink written 15 months ago by Michael Dondrup42k

have you some papers for hash that use nucleotides ??

ADD REPLYlink modified 15 months ago • written 15 months ago by midox180

Look for the khmer citation.

ADD REPLYlink written 15 months ago by Michael Dondrup42k

i see the citation but i not found!! have you another?? Thanks

ADD REPLYlink modified 15 months ago • written 15 months ago by midox180

what do u mean by 'i not found'? the paper is there, if u need more try the other thread or any of the papers in the references.

ADD REPLYlink written 15 months ago by Michael Dondrup42k
1

If you truly need a hash then I've found murmur3 to be quick and easy to use in C.

ADD REPLYlink written 15 months ago by Devon Ryan68k

i dont undestand the fonctionnality of Murmurhash. can you simplify me ?? THanks

ADD REPLYlink modified 15 months ago • written 15 months ago by midox180
1

It's a generic hash function, so you can support arbitrarily long kmers (or anything else, for that matter).

ADD REPLYlink written 15 months ago by Devon Ryan68k

ok thanks. do i need to undestand the algorithm? because in https://en.wikipedia.org/wiki/MurmurHash I did not understand the algorithm or the code! can you explain me if you can ?

ADD REPLYlink written 15 months ago by midox180
2

Don't bother trying to understand the hash function, just use it if you need it. This has good performance and is fairly widely used. I've used it in conjunction with the 2-bit encoding that Michael and the others have mentioned (i.e., encoding an A as 0, C as 1, and so on) and been happy with the results. Certainly if you can get away with just using the 2-bit encoded value then do it, you're unlikely to get better performance than that.

ADD REPLYlink written 15 months ago by Devon Ryan68k

OK thankyou. Just a few questions: what is the length of k-mers that can support MurMur ?? and can i use it on nucleotides ?? thanks for your responses.

ADD REPLYlink written 15 months ago by midox180
1

You can use any length of anything, it's a generic hashing function.

ADD REPLYlink written 15 months ago by Devon Ryan68k

in the focntion of MurMurHash there are (key, len, seed) the Key is the string , len is the lenght and what's seed? Thanks

ADD REPLYlink written 15 months ago by midox180
2

Seed is a random value that's used in the hashing. I just use uint32_t seed = 0xAAAAAAAA;.

ADD REPLYlink modified 15 months ago • written 15 months ago by Devon Ryan68k

if i use machine 64 what's the random value can i use?

ADD REPLYlink written 15 months ago by midox180
2

There's no difference between 32/64 bit machines with this, you can use the same value (e.g., 0xAAAAAAAA).

ADD REPLYlink written 15 months ago by Devon Ryan68k
1

On a 64bit machine, the most random value is 0x8000000000000000, plus or minus 0x8000000000000000. My personal fav is 0x0123456789ABCDEF, but dont tell anyone because every time some uses it, it becomes less random.

ADD REPLYlink modified 15 months ago • written 15 months ago by John11k
1

We need to add <sarcasm> tags to biostars.

ADD REPLYlink written 15 months ago by Devon Ryan68k

Hahaha, yes :) Or perhaps just a catch-all <bad advice> tag that I could encapsulate all my posts in :P

ADD REPLYlink written 15 months ago by John11k

it's logic to have a negative values by murmur?? because I have positive values and negative values !!!

ADD REPLYlink written 15 months ago by midox180
1

Presumably you need to recast to an unsigned integer. My code for this is:

uint64_t hashString(char *s) {
    int len = strlen(s);
    uint64_t hash_val[2];
    uint32_t seed = 0xAAAAAAAA;
#if UINTPTR_MAX == 0xffffffff
    MurmurHash3_x86_128((void *) s, len, seed, (void *) &hash_val);
#else
    MurmurHash3_x64_128((void *) s, len, seed, (void *) &hash_val);
#endif
    return hash_val[0];
}
ADD REPLYlink written 15 months ago by Devon Ryan68k

no i use machine 32 bit. i use the code in wikipedia website

ADD REPLYlink modified 15 months ago • written 15 months ago by midox180

Then use a uint32_t rather than an int32_t.

ADD REPLYlink written 15 months ago by Devon Ryan68k

yes in my function i used uint32_t but i have negative value. I think these are a false results???

ADD REPLYlink written 15 months ago by midox180
1

I guess you are printing you results wrong, because the u in uint32_t means unsigned i.e. no negative values.

ADD REPLYlink written 15 months ago by kloetzl630

That's not even a theoretical possibility. Presumably you're converting that to a signed value at some point.

ADD REPLYlink written 15 months ago by Devon Ryan68k

no, my value is like this in C code: uint32_t Value=0; Value=murmur('ACTGCAC',7,0xAAAAAAAA); printf("%d",Value); and here the result is negative value!!!

ADD REPLYlink modified 15 months ago • written 15 months ago by midox180
1

printf("%"PRIu32, Value);

You're converting an unsigned 32-bit integer to a signed int in what you're doing...

ADD REPLYlink modified 15 months ago • written 15 months ago by Devon Ryan68k
1

I wish I had your patience. ☺

ADD REPLYlink written 15 months ago by kloetzl630

what's printf("%"PRIu32, Value); it doesn't work in C!

ADD REPLYlink written 15 months ago by midox180

See here and here.

ADD REPLYlink written 15 months ago by kloetzl630

I'm somewhat amazed that you got things compiled without importing inttypes.h...

ADD REPLYlink written 15 months ago by Devon Ryan68k

no we juste use "%u" for unseigned. thanks

ADD REPLYlink written 15 months ago by midox180

Hello Devon.

Could you please share what the hash compression function do you use in order to assign and index for the hash value returned by murmur? What size of the hash table do you use?

Thanks in advance.

ADD REPLYlink written 4 months ago by germelcar20
1

So, fixed length input, and a hash table, I conclude you should optimize for lookup time/hash compute time, that would be the trivial hash algorithm, means you won't need any complex hashing function. Convert the k-mer into a 32-bit integer (if it fits) value and use the number to address an array...

ADD REPLYlink modified 15 months ago • written 15 months ago by Michael Dondrup42k
1

See MinHash

ADD REPLYlink written 15 months ago by 5heikki6.5k
1

See also: Is There A Fast Hashing Function For Nucleotide K-Mers (Q-Grams)?

ADD REPLYlink written 15 months ago by Michael Dondrup42k
1

That thread points out that if you want strand-unspecific hashing, make A/T 0 and C/G 1. I think thats pretty awesome :D Dump twice as much DNA in each hash.

ADD REPLYlink written 15 months ago by John11k

Are the generic functions (e.g., murmur hash) not sufficing for you?

ADD REPLYlink written 15 months ago by Devon Ryan68k

The up2bit format converts every sequence of DNA into a unique number, much like the 2bit format suggested by kloetzi, except the up2bit format costs 2 bits more, and allows you to store DNA of a variable length. It is stable over different struct sizes/types, (stuffing it into a signed or unsigned int wont matter, 'ATCG' has the same value in int16 as it does in int128, etc), and since its just bit shifting its faster than polynomial math.

However, its not a one-way hashing function since it can be reversed, so perhaps you were looking for another one of the many things Michael points.

code in python:

def dna_up2bit(DNA):
    up2bit = 1
    for twobit in [ ('A','C','T','G').index(char) for char in reversed(DNA) ]:
        up2bit = (up2bit << 2) + twobit
    return up2bit
ADD REPLYlink modified 15 months ago • written 15 months ago by John11k

Hello, I have a problem the hash function Murmur (). if I take k = 3 (lenght), the hash function normally gives values between 0 and 255. but I have values in millions and billions. Does this make sense?

ADD REPLYlink written 15 months ago by midox180
1

If you have more specific problems about implementation of this algorithm I would recommend to move over to stackexchange. However, this is a good hint as to why the use of a real hash is overkill, there are 64 unique 3-mers, why use the long integer to store them? The hash function is for arbitrary length input, so it doesn't take the k-mer length into account.

ADD REPLYlink written 15 months ago by Michael Dondrup42k

Sure that makes sense, the values should be distributed over the full range of possible integers.

Of course if everything is of length 3 then just do the 2-bit trick that everyone else mentioned. That'll be faster. It's when the length is too long or you have variable sized inputs that you need a generic hash function.

ADD REPLYlink modified 15 months ago • written 15 months ago by Devon Ryan68k
1

The up2bit code I pasted above does variable-length inputs :( I think the only time you need a generic hash-function when the OP asks for a generic hash function :P

ADD REPLYlink written 15 months ago by John11k

So what's the solution for DNA k-mer if a i have a small k. do i change the hash function???

ADD REPLYlink written 15 months ago by midox180
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 465 users visited in the last hour