Question: hash function for DNA sequences

0

Please log in to add an answer.

Use of this site constitutes acceptance of our User
Agreement
and Privacy
Policy.

Powered by Biostar
version 2.3.0

Traffic: 1026 users visited in the last hour

What for, what is your type of application? What's relevant for efficiency for you? Computational complexity, time, memory, cryptographic quality, expected number of collisions? What is your key size, input size? Specific programming languages?

44kI want to store in a hash table a k-mers. I want a good hash function in time and memory. I do not know yet the number of colisions. I program in C. thanks.

190If you just want to store kmers in a table, don't hash them. Just encode them to a number A → 0, C → 1, … T → 3 and weight each digit accordingly.

960no, i need to hash them because i want to compare sequences using the hash table. for this, i need a good hash function.

190The trivial hash is also a perfect one, it won't have collisions, so you can compare sequences. Also it is normally also cheapest to compute. If you only want to encode ACGT (no ambiguity), each base can be encoded as 2-bit. That means that 32 bits could address all 16 mers, and a 64-bit integer k = 32

44kmaybe i choose my k=8. can you explain me what's a trivial hash?? thanks

190Take for instance the kmer K=GTCGAATC (size 8)

If you take by convention A=0, C=1, G=2, T=3, you can encode K as a polynomial in base 4 like this:

You can thus encode any 8-mer as a unique integer value.

700ok thank you for your explanations. if i have collisions, what do I have in this case? because in my sequences i have ~15% of errors. Thanks

190With this technique the hash is bijective and thus no collisions occur for fixed k.

960ok.if i have collisions? what do I have in this case? because error cause collisions!

190I think this is a misunderstanding. Sequencing errors do not cause hash collisions. A collision in the sense of the topic would be like h(s): S -> H , there exists s, v in S, s != v AND h(s) = h(v)

A seq. error causes a sequence to contain a 'wrong' k-mer (ACgT) while the 'true' sequence is (ACAT). You have no chance to differentiate the two by means of a hashing function.

44khave you some papers for hash that use nucleotides ??

190Look for the khmer citation.

44ki see the citation but i not found!! have you another?? Thanks

190what do u mean by 'i not found'? the paper is there, if u need more try the other thread or any of the papers in the references.

44kIf you truly need a hash then I've found murmur3 to be quick and easy to use in C.

80ki dont undestand the fonctionnality of Murmurhash. can you simplify me ?? THanks

190It's a generic hash function, so you can support arbitrarily long kmers (or anything else, for that matter).

80kok thanks. do i need to undestand the algorithm? because in https://en.wikipedia.org/wiki/MurmurHash I did not understand the algorithm or the code! can you explain me if you can ?

190Don't bother trying to understand the hash function, just use it if you need it. This has good performance and is fairly widely used. I've used it in conjunction with the 2-bit encoding that Michael and the others have mentioned (i.e., encoding an A as 0, C as 1, and so on) and been happy with the results. Certainly if you can get away with just using the 2-bit encoded value then do it, you're unlikely to get better performance than that.

80kOK thankyou. Just a few questions: what is the length of k-mers that can support MurMur ?? and can i use it on nucleotides ?? thanks for your responses.

190You can use any length of anything, it's a generic hashing function.

80kin the focntion of MurMurHash there are (key, len, seed) the

Keyis the string ,lenis the lenght and? Thankswhat's seed190Seed is a random value that's used in the hashing. I just use

`uint32_t seed = 0xAAAAAAAA;`

.80kif i use machine 64 what's the random value can i use?

190There's no difference between 32/64 bit machines with this, you can use the same value (e.g., 0xAAAAAAAA).

80kOn a 64bit machine, the most random value is 0x8000000000000000, plus or minus 0x8000000000000000. My personal fav is 0x0123456789ABCDEF, but dont tell anyone because every time some uses it, it becomes less random.12kWe need to add

`<sarcasm>`

tags to biostars.80kHahaha, yes :) Or perhaps just a catch-all

`<bad advice>`

tag that I could encapsulate all my posts in :P12kit's logic to have a negative values by murmur?? because I have positive values and negative values !!!

190Presumably you need to recast to an unsigned integer. My code for this is:

80kno i use machine 32 bit. i use the code in wikipedia website

190Then use a

`uint32_t`

rather than an`int32_t`

.80kyes in my function i used

uint32_tbut i have negative value. I think these are a false results???190I guess you are printing you results wrong, because the

`u`

in`uint32_t`

means`unsigned`

i.e. no negative values.960That's not even a theoretical possibility. Presumably you're converting that to a signed value at some point.

80kno, my value is like this in C code:

uint32_t Value=0; Value=murmur('ACTGCAC',7,0xAAAAAAAA); printf("%d",Value);and here the result is negative value!!!190`printf("%"PRIu32, Value);`

You're converting an unsigned 32-bit integer to a signed

`int`

in what you're doing...80kI wish I had your patience. ☺

960what's

printf("%"PRIu32, Value);it doesn't work in C!190See here and here.

960I'm somewhat amazed that you got things compiled without importing

`inttypes.h`

...80kno we juste use "%u" for unseigned. thanks

190Hello Devon.

Could you please share what the hash compression function do you use in order to assign and index for the hash value returned by murmur? What size of the hash table do you use?

Thanks in advance.

20So, fixed length input, and a hash table, I conclude you should optimize for lookup time/hash compute time, that would be the trivial hash algorithm, means you won't need any complex hashing function. Convert the k-mer into a 32-bit integer (if it fits) value and use the number to address an array...

44kThe up2bit format converts every sequence of DNA into a unique number, much like the 2bit format suggested by kloetzi, except the up2bit format costs 2 bits more, and allows you to store DNA of a variable length. It is stable over different struct sizes/types, (stuffing it into a signed or unsigned int wont matter, 'ATCG' has the same value in int16 as it does in int128, etc), and since its just bit shifting its faster than polynomial math.

However, its not a one-way hashing function since it can be reversed, so perhaps you were looking for another one of the many things Michael points.

code in python:

12kSee MinHash

7.3kSee also: Is There A Fast Hashing Function For Nucleotide K-Mers (Q-Grams)?

44kThat thread points out that if you want strand-unspecific hashing, make A/T 0 and C/G 1. I think thats pretty awesome :D Dump twice as much DNA in each hash.

12kAre the generic functions (e.g., murmur hash) not sufficing for you?

80kHello, I have a problem the hash function Murmur (). if I take k = 3 (lenght), the hash function normally gives values between 0 and 255. but I have values in millions and billions. Does this make sense?

190If you have more specific problems about implementation of this algorithm I would recommend to move over to stackexchange. However, this is a good hint as to why the use of a real hash is overkill, there are 64 unique 3-mers, why use the long integer to store them? The hash function is for arbitrary length input, so it doesn't take the k-mer length into account.

44kSure that makes sense, the values should be distributed over the full range of possible integers.

Of course if everything is of length 3 then just do the 2-bit trick that everyone else mentioned. That'll be faster. It's when the length is too long or you have variable sized inputs that you need a generic hash function.

80kThe up2bit code I pasted above does variable-length inputs :( I think the only time you need a generic hash-function when the OP asks for a generic hash function :P

12kSo what's the solution for DNA k-mer if a i have a small k. do i change the hash function???

190