Question

Tool:Script to compute the effective genome size: epic-effective

12

Entering edit mode

8.0 years ago

endrebak ▴ 960

A perennial question on bioinformatics sites is how to compute the effective genome size for a genome. epic now includes a script called epic-effective to do just this. It can use multiple cores. So the next time someone asks about the effective genome size, you know where to point them.

endrebak@havpryd ~/c/epic> epic-effective -h
epic-effective
Compute the effective genome size from a fasta file.

(Visit github.com/endrebak/epic for examples and help.)

Usage:
    epic-effective --read-length=LEN [--nb-cpu=CPU] FILE
    epic-effective --help

Arguments:
    FILE                      Fasta genome
    -r LEN --read-length LEN  length of reads

Options:
    -h --help                 show this help message
    -n CPU --nb-cpu CPU       number of cores to use [default: 1]
endrebak@havpryd ~/c/epic> time epic-effective -r 35 -n 30 ~/genomes/hg19.fa
File analyzed:  /local/home/endrebak/genomes/hg19.fa
Genome length:  3095693983
Number unique 35-mers:  2529802735
Effective genome size:  0.8172005207531522
3250.78user 32.13system 4:46.33elapsed 1146%CPU (0avgtext+0avgdata 100815072maxresident)k
6186643inputs+0outputs (0major+162990minor)pagefaults 0swaps

To install the epic-package, just use pip install bioepic.

It is uses jellyfish under the hood so you need to install that too. If you use conda conda install jellyfish works on linux64.

I have included how I compute the effective genome size below. If the formula is too simplistic, please tell me.

This is how I compute the EGS:

The effective genome size for a genome G and a read-length L is the number of unique L-mers in G divided by the length of G.

So for reads of length 2 the effective genome size of the geome CCCGNN is the following:

len(["CG"]) / len("CCCGNN")

or 1/6.

Edit: sorry about the bump. My link was wrong!

ChIP-Seq • 3.7k views

ADD COMMENT • link updated 9 months ago by Ram 43k • written 8.0 years ago by endrebak ▴ 960

score 1 · Answer 1 · 2016-04-26

Here is the supplementary paper describing the way computing the effective genome size and length was originally done:

http://www.imagenix.com/uniqueome/Supplementary_Methods.pdf

It seems like they allow for some mismatches in the reads, which epic-effective does not do (it only considers completely unique sequences). This would explain why epic-effective gets a higher egs (10 % for human genomes iirc) than the classical methods.