Today, the main developer of the GATB library (edrezen) emailed us the following bit of code.
u_int64_t revcomp64_v2 (const u_int64_t& x, size_t sizeKmer) { u_int64_t res = x; res = ((res>> 2 & 0x3333333333333333) | (res & 0x3333333333333333) << 2); res = ((res>> 4 & 0x0F0F0F0F0F0F0F0F) | (res & 0x0F0F0F0F0F0F0F0F) << 4); res = ((res>> 8 & 0x00FF00FF00FF00FF) | (res & 0x00FF00FF00FF00FF) << 8); res = ((res>>16 & 0x0000FFFF0000FFFF) | (res & 0x0000FFFF0000FFFF) << 16); res = ((res>>32 & 0x00000000FFFFFFFF) | (res & 0x00000000FFFFFFFF) << 32); res = res ^ 0xAAAAAAAAAAAAAAAA; return (res >> (2*( 32 - sizeKmer))) ; }
It computes the reverse complement of a k-mer stored in 64 bits, very efficiently. I don't recall seeing bit tricks applied to revcomp like that :)
Technical details: each nucleotide is encoded in two bits (A=00b; C=01b; G=11b; T=10b). Why this encoding? because it's fast to convert from ascii: (ascii_nucleotide>>1)&3. (G. Rizk found that.)
This code will be included in the next version of the GATB-Core library.
that is pretty cool!