Question: k-mer counting into R using perfect hashing
4
Macherki M E120 wrote:

I worked in DNA K-mers counting and I prepare this formulation to solve the counting using perfect hash table :  Using Rcpp API(C++) to integrate the code into R:

`````` #include <Rcpp.h>
using namespace Rcpp;
/*
this code can be used with c++ by replacing IntegerVector by std::vector<int>
*/
//************************************************
inline const short int V (const char x){
switch(x){
case 'A':case 'a':
return 0;
break;
case 'C':case 'c':
return 1;
break;
case 'G':case 'g':
return 2;
break;
default:
return 3;
break;
}

}

inline unsigned int  X0( const std::string A,const int k ,const int n){
unsigned int  result=0;
int j=k;
for( int i=n-1;i>n-k-1;i--) {
result+= pow(4,k-j)*V(A[i]);
j--;
}
return result;
}

// [[Rcpp::export]]
inline IntegerVector kmer4(const std::string A,const int n,const int k)
{

IntegerVector P(pow(4,k));
int x=X0(A,k,n);
P[x]++;
const int N=pow(4,k-1);
for( int i=n-k-1;i>-1;i--){
x=N*V(A[i])+x/4-x%4/4;
P[x]++;
}
return P;
}
``````

There are two questions:

Assuming the index x of a kmer,the compliment of x is then (4^k)-x-1. Can we get the reverse using a numeric operation like preceding formula? There are two problems in running time: iteration over the string and vector creation where the k is over then 8. Are there ideas to solve these problems?

rcpp hashing c++ R k-mer • 2.0k views
modified 3.8 years ago • written 4.0 years ago by Macherki M E120
2

I'm not sure if you are doing this as a programming problem, or if you are more interested in the functionality of the software. If it is the latter, consider taking a look at the Bioconductor Biostrings package.

1

I'm also not sure about your goal. What are you trying to accomplish?

I am not very smart in bit transformation. I look to get the index of the reverse compliment of a k-mer parting from the original index.The compliment is solved but the reverse is more complicated. Second, I hope a collaboration to fast up the code overcoming principal bugs: string iteration and vector creation.

1

"Can you encode DNA, somehow, such that you can use a bitwise operator to reverse compliment it?"

?

1

Yes, how we can count kmers for the sequence and it 's reverse compliment at the same time?

5
John12k wrote:

I'll admit that I was a bit, um... chemically handicapped... last night when I tried to solve this puzzle. Don't drink and bitshift. So i've tidied it up a bit this morning and posted it as a more coherent answer. A refactor of shame.

There is no need for an operation to reverse compliment DNA that you've encoded - it is, by nature of being DNA, already reversed complimented. Complementation is built into the genetic code, and reversing a binary series is another way to say decode it backwards. Therefore, we can encode our DNA in any 2bit format we like, then by changing how we conceptualise what we have just written to memory, it is already reverse complimented.

``````>>> def twobit(DNA): return sum([('A','C','G','T').index(base)<<shift for shift,base in zip((0,2,4,6,8,10,12,14,16,18),DNA)])
>>> def bittwo(bits):  return ''.join([('A','C','G','T')[(bits >> x)&3] for x in (0,2,4,6,8,10,12,14,16,18) ])
>>> def revcomp(bits): return ''.join([('T','G','C','A')[(bits >> x)&3] for x in (18,16,14,12,10,8,6,4,2,0) ])
>>> twobit('AAACCGGGTT')
1026368
>>> bittwo(1026368)
'AAACCGGGTT'
>>> revcomp(1026368)
'AACCCGGTTT'
``````

Note that decoding to the original string, and decoding to the reverse compliment, takes exactly the same number of operations.

If you want to do a 2-to-1 mapping of kmers and their reverse compliment to a single binary value, thats a different problem entirely. The old trick is to find the lesser of the two encoded or decoded values and only write/print that. Many people do this on the encoding step to reduce their kmer table size by half, for example:

``````>>> min( twobit('AAACCGGGTT'), twobit('AACCCGGTTT') )
1026368 # only this gets written to the kmer table.
``````

However, it is but my opinion that such code suggests a design decision mistake. If you care about performance, this will hurt performance as it only takes 1 extra op to binary intersect a table twice as large, where it takes many ops to encode the string two ways and find out which is the smaller. On the other hand, if you care about memory, then there are much more efficient data structures than a kmer table, although of course they will be slower for many operations. This method of basically using computationally expensive lossy encoding is a poor compromise.

4
Macherki M E120 wrote:

I think that the problem is solved using this code using incrementation.

``````inline unsigned int  X1( const std::string A,const int k ){
unsigned int  result=V(A);
for( int i=1;i<k;i++) {
result+= pow(4,i)*V(A[i]);
}
return result;
}

// [[Rcpp::export]]
inline IntegerVector rc_kmer4(const std::string A,const int n,const int k)
{
const int np=k-1;
const int Lim=n-np;             //---- number of subsequence
const int beta=pow(4,k);
const int N=pow(4,np);
unsigned short int fo,ba;// forword and backword of substring
IntegerVector P(beta);
unsigned int x=X1(A,k);
unsigned int y=X0(A,k,k);
P[beta-x-1]++;
P[y]++;
for( int i=1;i<Lim;i++){
fo=V(A[i-1]);
ba=V(A[i+np]);
x=(x-fo)/4+ba*N;
y=4*y-beta*fo+ba;
P[beta-x-1]++;
P[y]++;
}
return P;
}
``````

It works with linear complexity. Think you for ideas @John