Blog: Fast kmer reverse complement code using bit tricks
11
gravatar for Rayan Chikhi
6.2 years ago by
Rayan Chikhi1.4k
France, Lille, CNRS
Rayan Chikhi1.4k wrote:

Today, the main developer of the GATB library (edrezen) emailed us the following bit of code. 

u_int64_t revcomp64_v2 (const u_int64_t& x, size_t sizeKmer)
{
    u_int64_t res = x;

    res = ((res>> 2 & 0x3333333333333333) | (res & 0x3333333333333333) <<  2);
    res = ((res>> 4 & 0x0F0F0F0F0F0F0F0F) | (res & 0x0F0F0F0F0F0F0F0F) <<  4);
    res = ((res>> 8 & 0x00FF00FF00FF00FF) | (res & 0x00FF00FF00FF00FF) <<  8);
    res = ((res>>16 & 0x0000FFFF0000FFFF) | (res & 0x0000FFFF0000FFFF) << 16);
    res = ((res>>32 & 0x00000000FFFFFFFF) | (res & 0x00000000FFFFFFFF) << 32);
    res = res ^ 0xAAAAAAAAAAAAAAAA;

    return (res >> (2*( 32 - sizeKmer))) ;
}

It computes the reverse complement of a k-mer stored in 64 bits, very efficiently. I don't recall seeing bit tricks applied to revcomp like that :)

Technical details: each nucleotide is encoded in two bits (A=00b; C=01b; G=11b; T=10b). Why this encoding? because it's fast to convert from ascii: (ascii_nucleotide>>1)&3. (G. Rizk found that.)

This code will be included in the next version of the GATB-Core library.

ADD COMMENTlink modified 9 months ago by Zev.Kronenberg11k • written 6.2 years ago by Rayan Chikhi1.4k

that is pretty cool! 

ADD REPLYlink written 6.2 years ago by Istvan Albert ♦♦ 85k

This post is old, but it's worth mentioning that this function either has a bug, or based on my testing, is not complete. See my answer below.

ADD REPLYlink written 9 months ago by Zev.Kronenberg11k
4
gravatar for Rob
6.2 years ago by
Rob4.5k
United States
Rob4.5k wrote:

Cool!  Jellyfish uses a similar approach.  Specifically, check out mer_dna.hpp, where Guillaume has implemented a fairly generic version (allowing for different integer widths) of this idea!

ADD COMMENTlink written 6.2 years ago by Rob4.5k

oh indeed, almost identical! good find

ADD REPLYlink written 6.2 years ago by Rayan Chikhi1.4k

Yes, some pretty good code in that file ! And +1 for the static polymorphism of the mer_base class.

ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by edrezen720
3
gravatar for Zev.Kronenberg
9 months ago by
United States
Zev.Kronenberg11k wrote:

This function is basically exactly the same as the OP, but it has a complement. In my testing the OP only reversed the bits.

This is my encoding {0:A, 1:C, 2:G, 3:T} :

  uint64_t ReverseComp64(const uint64_t mer, uint8_t kmerSize)
{
    uint64_t res = ~mer;

    res = ((res >> 2 & 0x3333333333333333) | (res & 0x3333333333333333) << 2);
    res = ((res >> 4 & 0x0F0F0F0F0F0F0F0F) | (res & 0x0F0F0F0F0F0F0F0F) << 4);
    res = ((res >> 8 & 0x00FF00FF00FF00FF) | (res & 0x00FF00FF00FF00FF) << 8);
    res = ((res >> 16 & 0x0000FFFF0000FFFF) | (res & 0x0000FFFF0000FFFF) << 16);
    res = ((res >> 32 & 0x00000000FFFFFFFF) | (res & 0x00000000FFFFFFFF) << 32);

    return (res >> (2 * (32 - kmerSize)));
}
ADD COMMENTlink written 9 months ago by Zev.Kronenberg11k
2
gravatar for martinsteinegger
22 months ago by
martinsteinegger30 wrote:

We compute the reverse complement k-mer in MMseqs2 using SIMD. The trick is to use two lookup tables (lookup1 and lookup2) for the reverse complements. The function should be roughly 1.3x times as fast as revcomp64_v2. It can probably still be optimized more.

// Compute reverse complement of k-mer in 2-bit-per-nucleotide encoding (A: 00, C: 01, T: 10, G: 11)
uint64_t revComplement(const uint64_t kmer, const int k) {
    // broadcast 64bit to 128 bit
    __m128i x = _mm_cvtsi64_si128(kmer);

    // create lookup (set 16 bytes in 128 bit)
    // a lookup entry at the index of two nucleotides (4 bit) describes the reverse
    // complement of these two nucleotides in the higher 4 bits (lookup1) or in the
    // lower 4 bits (lookup2)
    __m128i lookup1 = _mm_set_epi8(0x50,0x10,0xD0,0x90,0x40,0x00,0xC0,0x80,0x70,
                                   0x30,0xF0,0xB0,0x60,0x20,0xE0,0xA0);
    __m128i lookup2 = _mm_set_epi8(0x05,0x01,0x0D,0x09,0x04,0x00,0x0C,0x08,0x07,
                                   0x03,0x0F,0x0B,0x06,0x02,0x0E,0x0A);
    // set upper 8 bytes to 0 and revert order of lower 8 bytes
    __m128i upper = _mm_set_epi8(0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0,1,2,3,4,5,6,7);

    __m128i kmer1 = _mm_and_si128(x, _mm_set1_epi8(0x0F)); // get lower 4 bits
    __m128i kmer2 = _mm_and_si128(x, _mm_set1_epi8(0xF0)); // get higher 4 bits

    // shift right by 2 nucleotides
    kmer2 >>= 4;

    // use _mm_shuffle_epi8 to look up reverse complement
    kmer1 =_mm_shuffle_epi8(lookup1, kmer1);
    kmer2 = _mm_shuffle_epi8(lookup2, kmer2);

    // _mm_or_si128: bitwise OR
    x = _mm_or_si128(kmer1, kmer2);

    // set upper 8 bytes to 0 and revert order of lower 8 bytes
    x = _mm_shuffle_epi8(x, upper);

    // shift out the unused nucleotide positions (1 <= k <=32 )
    // broadcast 128 bit to 64 bit
    return (((uint64_t)_mm_cvtsi128_si64(x)) >> (uint64_t)(64-2*k));
}
ADD COMMENTlink written 22 months ago by martinsteinegger30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1688 users visited in the last hour