Tool:Script to compute the effective genome size: epic-effective
1
12
Entering edit mode
8.0 years ago
endrebak ▴ 960

A perennial question on bioinformatics sites is how to compute the effective genome size for a genome. epic now includes a script called epic-effective to do just this. It can use multiple cores. So the next time someone asks about the effective genome size, you know where to point them.

endrebak@havpryd ~/c/epic> epic-effective -h
epic-effective
Compute the effective genome size from a fasta file.

(Visit github.com/endrebak/epic for examples and help.)

Usage:
    epic-effective --read-length=LEN [--nb-cpu=CPU] FILE
    epic-effective --help

Arguments:
    FILE                      Fasta genome
    -r LEN --read-length LEN  length of reads

Options:
    -h --help                 show this help message
    -n CPU --nb-cpu CPU       number of cores to use [default: 1]
endrebak@havpryd ~/c/epic> time epic-effective -r 35 -n 30 ~/genomes/hg19.fa
File analyzed:  /local/home/endrebak/genomes/hg19.fa
Genome length:  3095693983
Number unique 35-mers:  2529802735
Effective genome size:  0.8172005207531522
3250.78user 32.13system 4:46.33elapsed 1146%CPU (0avgtext+0avgdata 100815072maxresident)k
6186643inputs+0outputs (0major+162990minor)pagefaults 0swaps

To install the epic-package, just use pip install bioepic.

It is uses jellyfish under the hood so you need to install that too. If you use conda conda install jellyfish works on linux64.

I have included how I compute the effective genome size below. If the formula is too simplistic, please tell me.


This is how I compute the EGS:

The effective genome size for a genome G and a read-length L is the number of unique L-mers in G divided by the length of G.

So for reads of length 2 the effective genome size of the geome CCCGNN is the following:

len(["CG"]) / len("CCCGNN")

or 1/6.

Edit: sorry about the bump. My link was wrong!

ChIP-Seq • 3.7k views
ADD COMMENT
1
Entering edit mode
8.0 years ago
endrebak ▴ 960

Here is the supplementary paper describing the way computing the effective genome size and length was originally done:

http://www.imagenix.com/uniqueome/Supplementary_Methods.pdf

It seems like they allow for some mismatches in the reads, which epic-effective does not do (it only considers completely unique sequences). This would explain why epic-effective gets a higher egs (10 % for human genomes iirc) than the classical methods.

ADD COMMENT

Login before adding your answer.

Traffic: 2341 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6