I recently had a very similar question with a motif that I was working on, and I took two approaches, one analytical and one empirical way:
I did almost what you suggested. In addition, I also took the base composition of the scanned sequences into account. E.g, if there are 30% As, the probability to see an A at a specific position is 0.3. The probability of a 7-mer like AATGCCA would be:
(prob. of A) ^ 3 * (prob. of C) ^ 2 * (prob. of T) ^ 1 * (prob. of G) ^ 1
This probability times the number of scanned windows (10000 – 6 for 10kB) gives the number of occurrences that we would expect just by chance.
What I did not like about this approach is that I assume the scanned windows to be independent from each other, which they are not, because each window overlaps by 6 bases with the neighboring windows.
I shuffled the nucleotides of the scanned sequence many times, each time scanning the randomly generated sequence and counting the occurrences of my motif. On average, this gave me exactly the result that I predicted with my first approach. I guess that the dependence of the windows is not of practical relevance for long sequences.
In addition, one could calculate something like a p-value for the observed occurrences, asking: What is the probability to find the motif at least as often as we did just by chance. In terms of my first approach, one would use a Poisson distribution to do this, using the calculated probability of the k-mer and the number of scanned windows as n. In terms of the second approach, one would just see in how many cases the motif was found as often as actually observed or more in the shuffled sequences. For example: You found the motif twice in your scanned sequence. After shuffling the sequence 1000 times, in 30 cases you found the motif twice or even more often than twice. This would give you a p-value of 0.03.