A perennial question on bioinformatics sites is how to compute the effective genome size for a genome. epic now includes a script called epic-effective to do just this. It can use multiple cores. So the next time someone asks about the effective genome size, you know where to point them.
endrebak@havpryd ~/c/epic> epic-effective -h epic-effective Compute the effective genome size from a fasta file. (Visit github.com/endrebak/epic for examples and help.) Usage: epic-effective --read-length=LEN [--nb-cpu=CPU] FILE epic-effective --help Arguments: FILE Fasta genome -r LEN --read-length LEN length of reads Options: -h --help show this help message -n CPU --nb-cpu CPU number of cores to use [default: 1] endrebak@havpryd ~/c/epic> time epic-effective -r 35 -n 30 ~/genomes/hg19.fa File analyzed: /local/home/endrebak/genomes/hg19.fa Genome length: 3095693983 Number unique 35-mers: 2529802735 Effective genome size: 0.8172005207531522 3250.78user 32.13system 4:46.33elapsed 1146%CPU (0avgtext+0avgdata 100815072maxresident)k 6186643inputs+0outputs (0major+162990minor)pagefaults 0swaps
To install the epic-package, just use
pip install bioepic.
It is uses jellyfish under the hood so you need to install that too. If you use conda
conda install jellyfish works on linux64.
I have included how I compute the effective genome size below. If the formula is too simplistic, please tell me.
This is how I compute the EGS:
The effective genome size for a genome G and a read-length L is the number of unique L-mers in G divided by the length of G.
So for reads of length 2 the effective genome size of the geome CCCGNN is the following:
len(["CG"]) / len("CCCGNN")
Edit: sorry about the bump. My link was wrong!