Today, the main developer of the GATB library (edrezen) emailed us the following bit of code.
u_int64_t revcomp64_v2 (const u_int64_t& x, size_t sizeKmer)
{
u_int64_t res = x;
res = ((res>> 2 & 0x3333333333333333) | (res & 0x3333333333333333) << 2);
res = ((res>> 4 & 0x0F0F0F0F0F0F0F0F) | (res & 0x0F0F0F0F0F0F0F0F) << 4);
res = ((res>> 8 & 0x00FF00FF00FF00FF) | (res & 0x00FF00FF00FF00FF) << 8);
res = ((res>>16 & 0x0000FFFF0000FFFF) | (res & 0x0000FFFF0000FFFF) << 16);
res = ((res>>32 & 0x00000000FFFFFFFF) | (res & 0x00000000FFFFFFFF) << 32);
res = res ^ 0xAAAAAAAAAAAAAAAA;
return (res >> (2*( 32 - sizeKmer))) ;
}
It computes the reverse complement of a k-mer stored in 64 bits, very efficiently. I don't recall seeing bit tricks applied to revcomp like that :)
Technical details: each nucleotide is encoded in two bits (A=00b; C=01b; G=11b; T=10b). Why this encoding? because it's fast to convert from ascii: (ascii_nucleotide>>1)&3. (G. Rizk found that.)
This code will be included in the next version of the GATB-Core library.
This post is old, but it's worth mentioning that this function either has a bug, or based on my testing, is not complete. See my answer below.
that is pretty cool!