Tool: Script to compute the effective genome size: epic-effective
10
gravatar for endrebak
4.5 years ago by
endrebak840
github.com/endrebak
endrebak840 wrote:

A perennial question on bioinformatics sites is how to compute the effective genome size for a genome. epic now includes a script called epic-effective to do just this. It can use multiple cores. So the next time someone asks about the effective genome size, you know where to point them.

endrebak@havpryd ~/c/epic> epic-effective -h
epic-effective
Compute the effective genome size from a fasta file.

(Visit github.com/endrebak/epic for examples and help.)

Usage:
    epic-effective --read-length=LEN [--nb-cpu=CPU] FILE
    epic-effective --help

Arguments:
    FILE                      Fasta genome
    -r LEN --read-length LEN  length of reads

Options:
    -h --help                 show this help message
    -n CPU --nb-cpu CPU       number of cores to use [default: 1]
endrebak@havpryd ~/c/epic> time epic-effective -r 35 -n 30 ~/genomes/hg19.fa
File analyzed:  /local/home/endrebak/genomes/hg19.fa
Genome length:  3095693983
Number unique 35-mers:  2529802735
Effective genome size:  0.8172005207531522
3250.78user 32.13system 4:46.33elapsed 1146%CPU (0avgtext+0avgdata 100815072maxresident)k
6186643inputs+0outputs (0major+162990minor)pagefaults 0swaps

To install the epic-package, just use pip install bioepic.

It is uses jellyfish under the hood so you need to install that too. If you use conda conda install jellyfish works on linux64.

I have included how I compute the effective genome size below. If the formula is too simplistic, please tell me.


This is how I compute the EGS:

The effective genome size for a genome G and a read-length L is the number of unique L-mers in G divided by the length of G.

So for reads of length 2 the effective genome size of the geome CCCGNN is the following:

len(["CG"]) / len("CCCGNN")

or 1/6.

Edit: sorry about the bump. My link was wrong!

tool chip-seq • 2.5k views
ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by endrebak840
1
gravatar for endrebak
4.5 years ago by
endrebak840
github.com/endrebak
endrebak840 wrote:

Here is the supplementary paper describing the way computing the effective genome size and length was originally done:

http://www.imagenix.com/uniqueome/Supplementary_Methods.pdf

It seems like they allow for some mismatches in the reads, which epic-effective does not do (it only considers completely unique sequences). This would explain why epic-effective gets a higher egs (10 % for human genomes iirc) than the classical methods.

ADD COMMENTlink written 4.5 years ago by endrebak840
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2259 users visited in the last hour