There are any good tutorial or book to help to get some stats in kmer analysis?
3.9 years ago
psschlogl ▴ 50

I am trying to get some kmer analysis (counting, get some **scores and testings) in some genomes and I am reading some stuff, but some are so obscure and boring, because of a much far-fetched explanation or confusing formulas and derivations. I wish a reference a lil bit light, concise, and clear of what is going on. I am not a math or statistician so, I need some clear and direct.

Do you guys have some directions.

I am using some of this:

P(W) = P(W1 | W2...Wn-1) * P(Wn | W2...Wn-1) * P(W2...Wn-1)


Probability that an arbitrary n-mer is the word (W) it will be used 3 components: the probability that the core (n-2) bases match, and the probabilities of the first and last bases given that the core matches.

E(C(W)) = C(W1...Wn-1) * C(W2...Wn) / C(W2...Wn-1)


E(C(W)) is the expected value for the count of the number of times W occurs in the genome, and C(Wi...Wj) is the actual count of the number of times the word Wi...Wj occurs.

Variance

Var(C(W)) = N* P(W) * (1-P(W)) = E(C(W)) * (1 - E(C(W))/N)


The std

sigma(W) = sqrt(E(C(W)) * (1 - E(C(W))/N))


And the z-score

Z(W) = (C(W) – E(C(W))) / sigma(W)


to detect under/over abundant kmers.

I would like to learn and be pointed to some other scores and more importantly tests for the analysis.

I really appreciate any help.

Paulo

3.9 years ago
khorms ▴ 230

When you are looking for overrepresented k-mers in biological sequences, you are usually comparing one group of sequences to another. That means that the background distribution has to come from the control group of sequences. I am not sure how are you planning to incorporate such background distribution into your framework here. There has been a lot of work published on the subject. I would recommend looking into information theory - based methods such as FIRE (paper, website) because of their flexibility.

I was thinking in create some random genomes based on background bases frequencies that are the same as the original genomes,using something like random.choices. Basically, is what I have seen in some papers. Thank you for your time. And I will check it out yours refs.