Question: How do I compute the effective genome size?
2
gravatar for Endre Bakken Stovner
3.6 years ago by
Norway
Endre Bakken Stovner880 wrote:

Several pieces of software require this parameter.

Is counting the number of masked nucleotides in fasta files going to give a good approximate result?

If not, is there a simple way to do it?

chip-seq • 2.8k views
ADD COMMENTlink modified 3.3 years ago by Biostar ♦♦ 20 • written 3.6 years ago by Endre Bakken Stovner880
3
gravatar for thackl
3.6 years ago by
thackl2.6k
MIT
thackl2.6k wrote:

EDIT: if you understand effective genome size as "mappable" genome size, than Devon is right, of course.

Assemblies only will provide you with a good size estimate if they are of really high quality. This is usually only the case for either model organisms or small, bacterial genomes.

Assemblies of larger genomes such as plant, animals etc., and in particular draft genomes usually do not contain a complete representation of a single haplogenome - which you would need to get your size estimate right. The reasons are that assembly algorithms usually cannot resolve all repeats and centromeric/telomeric regions, and also are prone to generate multiple sequences for different alleles of the the same region.

In my opinion, there are two better approaches:

1) Use the experimentally determined nuclear DNA (e.g. www.genomesize.com) content to calculate the haploid genome size. DNA content in pg can be directly converted into a bp estimate.

2) Use a k-mer based approaches to estimate the genome sizes form a high coverage NGS data set of your organism

ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by thackl2.6k
1

Note to self and others: Conversion of pg is just multiplying with .978 * 10^9

http://www.genomesize.com/faq.php

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by Endre Bakken Stovner880

Thanks. It is mostly for assembled genomes (if that is what assembly means). It will also mostly be used for model species.

ADD REPLYlink written 3.6 years ago by Endre Bakken Stovner880
3
gravatar for Devon Ryan
3.6 years ago by
Devon Ryan89k
Freiburg, Germany
Devon Ryan89k wrote:

The simplest method is to just subtract the number of Ns from the total length of the genome. That will over estimate things, but since a real number is read length/pair vs. single end/insert size dependent, this is a simpler and quicker approximation.

ADD COMMENTlink written 3.6 years ago by Devon Ryan89k

I'll go this route then. It is probably good enough for SICER/MACS.

ADD REPLYlink written 3.6 years ago by Endre Bakken Stovner880

Sb in my group pointed out that this is a very bad approx: doing it for the human genome gives 95% while the actual number is 74%

ADD REPLYlink written 3.6 years ago by Endre Bakken Stovner880
1
gravatar for Fidel
3.6 years ago by
Fidel1.9k
Germany
Fidel1.9k wrote:

In programs like MACS, the effective genome size is used to compute statistics of mapped reads with respect to the size of the genome covered by reads. Such size varies depending on read length and mapping strategy. With mapping strategy I just mean whether multi-mapping reads are kept or discarded. This can introduce a difference of about 20% in human and mouse effective genomes sizes.

If multi-mapping reads (reads that map to multiple positions) are kept then the strategy given by Devon can be used because all positions in the genome can be covered by reads excepts for stretches of NNNs.

Otherwise,  the best way to compute the effective genome size is to add up all positions being covered by reads or, if you are using a model organism you can use this table although is a bit outdated as they used reads of length 30.

 

 

ADD COMMENTlink written 3.6 years ago by Fidel1.9k
0
gravatar for Antonio R. Franco
3.6 years ago by
Spain. Universidad de Córdoba
Antonio R. Franco4.0k wrote:

Wondering how to calculate the pg content of a determined organism

ADD COMMENTlink written 3.6 years ago by Antonio R. Franco4.0k

It isn't computed, but experimentally determined and you can look it up at www.genomesize.com

ADD REPLYlink written 3.6 years ago by Endre Bakken Stovner880

The most common approach is probably flow cytometry - you can used DNA binding dyes, such as propidium iodide and measure fluorescence per nucleus in a FACS machine.

Dolezel, 2007 and Veselska, 2014 describe some protocols I can recommend.

ADD REPLYlink written 3.6 years ago by thackl2.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1396 users visited in the last hour