Question

How To Interpret The Formula Of Estimating The Genome Size From Kmers

0

Entering edit mode

10.3 years ago

Sabiha • 0

This is the formula to estimate the genome size.

N = (M*L)/(L-K+1)

and

Genome_size = T/N,

where

N: Depth, M: Kmer peak, K: Kmer-size, L: avg readlength, T: Total bases.

What is the Total bases?

Is it the total bases of reads been taken by any assembler(abyss,soap)

How will I know the total bases

genome kmer • 3.9k views

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 10.3 years ago by Sabiha • 0

0

Entering edit mode

Hi Sabiha,

Have you ever come across a situation that different k value will lead to greatly different genome size result? For example, with read length of 300bp, I test on my data with k=33 and have an only peak at 26, then k=121 have an only peak at 7, the calculated N (depth) based on your formula is 29.1 and 11.7, correspondingly. Given that T (total bases) is the same (of course because same data set), therefore the genome size would be greatly different! How would you determine which one is the good estimation of your genome size?

Thanks.

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 8.6 years ago by pbigbig ▴ 250

Ram · Answer 1 · 2015-02-19

0

Entering edit mode

9.2 years ago

edrezen ▴ 730

Total bases is the number of nucleotides in your data. For instance:

>seq1 len=15
ATCACACAGTTGTAC
>s2 len=21
ATAGATAGAATATGATAGATA

Here, T=15+21=36

If your file is in FASTA format, you can use the following to get the number of bases:

grep -v ">" file.fasta | wc -c

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by edrezen ▴ 730