when searching for DNA motifs in upstream sequences (for instance with the MEME suite), it is suggested to add a background model to distinguish the motif from the sequence background noise. One possibility is to use Markov Models, in which the frequency of k-mers is computed (k-order).
My question is: how to decide the value of k? A search on related literature states that it should be proportional to the putative motif length, but no clear rule of thumb is given. Our guess is that it shouldn't be too big for computing and overfitting problems.