I am trying to get some kmer analysis (counting, get some **scores and testings) in some genomes and I am reading some stuff, but some are so obscure and boring, because of a much far-fetched explanation or confusing formulas and derivations. I wish a reference a lil bit light, concise, and clear of what is going on. I am not a math or statistician so, I need some clear and direct.

Do you guys have some directions.

I am using some of this:

```
P(W) = P(W1 | W2...Wn-1) * P(Wn | W2...Wn-1) * P(W2...Wn-1)
```

Probability that an arbitrary n-mer is the word (W) it will be used 3 components: the probability that the core (n-2) bases match, and the probabilities of the first and last bases given that the core matches.

```
E(C(W)) = C(W1...Wn-1) * C(W2...Wn) / C(W2...Wn-1)
```

E(C(W)) is the expected value for the count of the number of times W occurs in the genome, and C(Wi...Wj) is the actual count of the number of times the word Wi...Wj occurs.

Variance

```
Var(C(W)) = N* P(W) * (1-P(W)) = E(C(W)) * (1 - E(C(W))/N)
```

The std

```
sigma(W) = sqrt(E(C(W)) * (1 - E(C(W))/N))
```

And the z-score

```
Z(W) = (C(W) – E(C(W))) / sigma(W)
```

to detect under/over abundant kmers.

I would like to learn and be pointed to some other scores and more importantly tests for the analysis.

I really appreciate any help.

Thank you for your time.

Paulo

I was thinking in create some random genomes based on background bases frequencies that are the same as the original genomes,using something like random.choices. Basically, is what I have seen in some papers. Thank you for your time. And I will check it out yours refs.