Tool: Script to compute the effective genome size: epic-effective
gravatar for endrebak
4.5 years ago by
endrebak840 wrote:

A perennial question on bioinformatics sites is how to compute the effective genome size for a genome. epic now includes a script called epic-effective to do just this. It can use multiple cores. So the next time someone asks about the effective genome size, you know where to point them.

endrebak@havpryd ~/c/epic> epic-effective -h
Compute the effective genome size from a fasta file.

(Visit for examples and help.)

    epic-effective --read-length=LEN [--nb-cpu=CPU] FILE
    epic-effective --help

    FILE                      Fasta genome
    -r LEN --read-length LEN  length of reads

    -h --help                 show this help message
    -n CPU --nb-cpu CPU       number of cores to use [default: 1]
endrebak@havpryd ~/c/epic> time epic-effective -r 35 -n 30 ~/genomes/hg19.fa
File analyzed:  /local/home/endrebak/genomes/hg19.fa
Genome length:  3095693983
Number unique 35-mers:  2529802735
Effective genome size:  0.8172005207531522
3250.78user 32.13system 4:46.33elapsed 1146%CPU (0avgtext+0avgdata 100815072maxresident)k
6186643inputs+0outputs (0major+162990minor)pagefaults 0swaps

To install the epic-package, just use pip install bioepic.

It is uses jellyfish under the hood so you need to install that too. If you use conda conda install jellyfish works on linux64.

I have included how I compute the effective genome size below. If the formula is too simplistic, please tell me.

This is how I compute the EGS:

The effective genome size for a genome G and a read-length L is the number of unique L-mers in G divided by the length of G.

So for reads of length 2 the effective genome size of the geome CCCGNN is the following:

len(["CG"]) / len("CCCGNN")

or 1/6.

Edit: sorry about the bump. My link was wrong!

tool chip-seq • 2.5k views
ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by endrebak840
gravatar for endrebak
4.5 years ago by
endrebak840 wrote:

Here is the supplementary paper describing the way computing the effective genome size and length was originally done:

It seems like they allow for some mismatches in the reads, which epic-effective does not do (it only considers completely unique sequences). This would explain why epic-effective gets a higher egs (10 % for human genomes iirc) than the classical methods.

ADD COMMENTlink written 4.5 years ago by endrebak840
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2259 users visited in the last hour