Several pieces of software require this parameter.
Is counting the number of masked nucleotides in fasta files going to give a good approximate result?
If not, is there a simple way to do it?
Several pieces of software require this parameter.
Is counting the number of masked nucleotides in fasta files going to give a good approximate result?
If not, is there a simple way to do it?
EDIT: if you understand effective genome size as "mappable" genome size, than Devon is right, of course.
Assemblies only will provide you with a good size estimate if they are of really high quality. This is usually only the case for either model organisms or small, bacterial genomes.
Assemblies of larger genomes such as plant, animals etc., and in particular draft genomes usually do not contain a complete representation of a single haplogenome - which you would need to get your size estimate right. The reasons are that assembly algorithms usually cannot resolve all repeats and centromeric/telomeric regions, and also are prone to generate multiple sequences for different alleles of the the same region.
In my opinion, there are two better approaches:
1) Use the experimentally determined nuclear DNA (e.g. www.genomesize.com) content to calculate the haploid genome size. DNA content in pg can be directly converted into a bp estimate.
2) Use a k-mer based approaches to estimate the genome sizes form a high coverage NGS data set of your organism
The simplest method is to just subtract the number of Ns from the total length of the genome. That will over estimate things, but since a real number is read length/pair vs. single end/insert size dependent, this is a simpler and quicker approximation.
In programs like MACS, the effective genome size is used to compute statistics of mapped reads with respect to the size of the genome covered by reads. Such size varies depending on read length and mapping strategy. With mapping strategy I just mean whether multi-mapping reads are kept or discarded. This can introduce a difference of about 20% in human and mouse effective genomes sizes.
If multi-mapping reads (reads that map to multiple positions) are kept then the strategy given by Devon can be used because all positions in the genome can be covered by reads excepts for stretches of NNNs.
Otherwise, the best way to compute the effective genome size is to add up all positions being covered by reads or, if you are using a model organism you can use this table although is a bit outdated as they used reads of length 30.
Wondering how to calculate the pg content of a determined organism
It isn't computed, but experimentally determined and you can look it up at http://www.genomesize.com/
The most common approach is probably flow cytometry - you can used DNA binding dyes, such as propidium iodide and measure fluorescence per nucleus in a FACS machine.
Dolezel, 2007 and Veselska, 2014 describe some protocols I can recommend.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Note to self and others: Conversion of pg is just multiplying with .978 * 10^9
http://www.genomesize.com/faq.php
Thanks. It is mostly for assembled genomes (if that is what assembly means). It will also mostly be used for model species.