Tool:Script to compute the effective genome size: epic-effective
Entering edit mode
7.7 years ago
endrebak ▴ 930

A perennial question on bioinformatics sites is how to compute the effective genome size for a genome. epic now includes a script called epic-effective to do just this. It can use multiple cores. So the next time someone asks about the effective genome size, you know where to point them.

endrebak@havpryd ~/c/epic> epic-effective -h
Compute the effective genome size from a fasta file.

(Visit for examples and help.)

    epic-effective --read-length=LEN [--nb-cpu=CPU] FILE
    epic-effective --help

    FILE                      Fasta genome
    -r LEN --read-length LEN  length of reads

    -h --help                 show this help message
    -n CPU --nb-cpu CPU       number of cores to use [default: 1]
endrebak@havpryd ~/c/epic> time epic-effective -r 35 -n 30 ~/genomes/hg19.fa
File analyzed:  /local/home/endrebak/genomes/hg19.fa
Genome length:  3095693983
Number unique 35-mers:  2529802735
Effective genome size:  0.8172005207531522
3250.78user 32.13system 4:46.33elapsed 1146%CPU (0avgtext+0avgdata 100815072maxresident)k
6186643inputs+0outputs (0major+162990minor)pagefaults 0swaps

To install the epic-package, just use pip install bioepic.

It is uses jellyfish under the hood so you need to install that too. If you use conda conda install jellyfish works on linux64.

I have included how I compute the effective genome size below. If the formula is too simplistic, please tell me.

This is how I compute the EGS:

The effective genome size for a genome G and a read-length L is the number of unique L-mers in G divided by the length of G.

So for reads of length 2 the effective genome size of the geome CCCGNN is the following:

len(["CG"]) / len("CCCGNN")

or 1/6.

Edit: sorry about the bump. My link was wrong!

ChIP-Seq • 3.6k views
Entering edit mode
7.6 years ago
endrebak ▴ 930

Here is the supplementary paper describing the way computing the effective genome size and length was originally done:

It seems like they allow for some mismatches in the reads, which epic-effective does not do (it only considers completely unique sequences). This would explain why epic-effective gets a higher egs (10 % for human genomes iirc) than the classical methods.


Login before adding your answer.

Traffic: 1417 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6