How do I compute the effective genome size?
4
2
Entering edit mode
9.1 years ago

Several pieces of software require this parameter.

Is counting the number of masked nucleotides in fasta files going to give a good approximate result?

If not, is there a simple way to do it?

ChIP-Seq • 9.7k views
ADD COMMENT
3
Entering edit mode
9.1 years ago
thackl ★ 3.0k

EDIT: if you understand effective genome size as "mappable" genome size, than Devon is right, of course.

Assemblies only will provide you with a good size estimate if they are of really high quality. This is usually only the case for either model organisms or small, bacterial genomes.

Assemblies of larger genomes such as plant, animals etc., and in particular draft genomes usually do not contain a complete representation of a single haplogenome - which you would need to get your size estimate right. The reasons are that assembly algorithms usually cannot resolve all repeats and centromeric/telomeric regions, and also are prone to generate multiple sequences for different alleles of the the same region.

In my opinion, there are two better approaches:

1) Use the experimentally determined nuclear DNA (e.g. www.genomesize.com) content to calculate the haploid genome size. DNA content in pg can be directly converted into a bp estimate.

2) Use a k-mer based approaches to estimate the genome sizes form a high coverage NGS data set of your organism

ADD COMMENT
1
Entering edit mode

Note to self and others: Conversion of pg is just multiplying with .978 * 10^9

http://www.genomesize.com/faq.php

ADD REPLY
0
Entering edit mode

Thanks. It is mostly for assembled genomes (if that is what assembly means). It will also mostly be used for model species.

ADD REPLY
3
Entering edit mode
9.1 years ago

The simplest method is to just subtract the number of Ns from the total length of the genome. That will over estimate things, but since a real number is read length/pair vs. single end/insert size dependent, this is a simpler and quicker approximation.

ADD COMMENT
0
Entering edit mode

I'll go this route then. It is probably good enough for SICER/MACS.

ADD REPLY
0
Entering edit mode

Sb in my group pointed out that this is a very bad approx: doing it for the human genome gives 95% while the actual number is 74%

ADD REPLY
1
Entering edit mode
9.1 years ago
Fidel ★ 2.0k

In programs like MACS, the effective genome size is used to compute statistics of mapped reads with respect to the size of the genome covered by reads. Such size varies depending on read length and mapping strategy. With mapping strategy I just mean whether multi-mapping reads are kept or discarded. This can introduce a difference of about 20% in human and mouse effective genomes sizes.

If multi-mapping reads (reads that map to multiple positions) are kept then the strategy given by Devon can be used because all positions in the genome can be covered by reads excepts for stretches of NNNs.

Otherwise, the best way to compute the effective genome size is to add up all positions being covered by reads or, if you are using a model organism you can use this table although is a bit outdated as they used reads of length 30.

ADD COMMENT
0
Entering edit mode
9.1 years ago

Wondering how to calculate the pg content of a determined organism

ADD COMMENT
0
Entering edit mode

It isn't computed, but experimentally determined and you can look it up at http://www.genomesize.com/

ADD REPLY
0
Entering edit mode

The most common approach is probably flow cytometry - you can used DNA binding dyes, such as propidium iodide and measure fluorescence per nucleus in a FACS machine.

Dolezel, 2007 and Veselska, 2014 describe some protocols I can recommend.

ADD REPLY

Login before adding your answer.

Traffic: 1335 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6