Question

How do I compute the effective genome size?

2

Entering edit mode

9.2 years ago

Endre Bakken Stovner ▴ 970

Several pieces of software require this parameter.

Is counting the number of masked nucleotides in fasta files going to give a good approximate result?

If not, is there a simple way to do it?

ChIP-Seq • 9.8k views

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 9.2 years ago by Endre Bakken Stovner ▴ 970

3

Entering edit mode

9.2 years ago

Devon Ryan 104k

The simplest method is to just subtract the number of Ns from the total length of the genome. That will over estimate things, but since a real number is read length/pair vs. single end/insert size dependent, this is a simpler and quicker approximation.

ADD COMMENT • link 9.2 years ago by Devon Ryan 104k

0

Entering edit mode

I'll go this route then. It is probably good enough for SICER/MACS.

ADD REPLY • link 9.2 years ago by Endre Bakken Stovner ▴ 970

0

Entering edit mode

Sb in my group pointed out that this is a very bad approx: doing it for the human genome gives 95% while the actual number is 74%

ADD REPLY • link 9.2 years ago by Endre Bakken Stovner ▴ 970

1

Entering edit mode

9.2 years ago

Fidel ★ 2.0k

In programs like MACS, the effective genome size is used to compute statistics of mapped reads with respect to the size of the genome covered by reads. Such size varies depending on read length and mapping strategy. With mapping strategy I just mean whether multi-mapping reads are kept or discarded. This can introduce a difference of about 20% in human and mouse effective genomes sizes.

If multi-mapping reads (reads that map to multiple positions) are kept then the strategy given by Devon can be used because all positions in the genome can be covered by reads excepts for stretches of NNNs.

Otherwise, the best way to compute the effective genome size is to add up all positions being covered by reads or, if you are using a model organism you can use this table although is a bit outdated as they used reads of length 30.

ADD COMMENT • link updated 5.0 years ago by Ram 44k • written 9.2 years ago by Fidel ★ 2.0k

0

Entering edit mode

9.2 years ago

Antonio R. Franco ★ 5.1k

Wondering how to calculate the pg content of a determined organism

ADD COMMENT • link 9.2 years ago by Antonio R. Franco ★ 5.1k

0

Entering edit mode

It isn't computed, but experimentally determined and you can look it up at http://www.genomesize.com/

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.2 years ago by Endre Bakken Stovner ▴ 970

0

Entering edit mode

The most common approach is probably flow cytometry - you can used DNA binding dyes, such as propidium iodide and measure fluorescence per nucleus in a FACS machine.

Dolezel, 2007 and Veselska, 2014 describe some protocols I can recommend.

ADD REPLY • link 9.2 years ago by thackl ★ 3.0k

score 3 · Accepted Answer · 2015-09-11

EDIT: if you understand effective genome size as "mappable" genome size, than Devon is right, of course.

Assemblies only will provide you with a good size estimate if they are of really high quality. This is usually only the case for either model organisms or small, bacterial genomes.

Assemblies of larger genomes such as plant, animals etc., and in particular draft genomes usually do not contain a complete representation of a single haplogenome - which you would need to get your size estimate right. The reasons are that assembly algorithms usually cannot resolve all repeats and centromeric/telomeric regions, and also are prone to generate multiple sequences for different alleles of the the same region.

In my opinion, there are two better approaches:

1) Use the experimentally determined nuclear DNA (e.g. www.genomesize.com) content to calculate the haploid genome size. DNA content in pg can be directly converted into a bp estimate.

2) Use a k-mer based approaches to estimate the genome sizes form a high coverage NGS data set of your organism