Genome size estimation discrepancy : Jellyfish k-mer Vs. estimate_genome_size.pl
2
0
Entering edit mode
5.2 years ago
Anand Rao ▴ 550

Hi folks,

I tried Genome Size Estimation using 2 resources:

1. Instructions at Kanazawa's K-mer and genome size estimation page

2. A Perl script from Joseph Ryan

For both of these, I used the same Jellyfish k-mer histogram file as input (input file), generated using the following syntax:

 jellyfish count -t 24 -C -m 25 -s 5G -o EthFoc-2_jellyfish --min-qual-char=5 EthFoc-2.*.txt_val*.fq # paired end file pair


and

jellyfish histo -o EthFoc-2_jellyfish.histo EthFoc-2_jellyfish


But the genome size and coverage estimates from each of these calculations are quite different.

Kanazawa's method results:

Genome size estimate - 56.3MB ; Coverage estimate - 50X? (also peak of histogram)

Ryan's method results:

estimate_genome_size.pl --kmer=25 --peak=50 --fastq=EthFoc-2.S282_L007.1.txt_val_1.fq EthFoc-2.S282_L007.2.txt_val_2.fq


Genome size estimate - 36.8MB ; Coverage estimate - ~60X

My guess is that Kanazawa's is a good 1st approximation, but the Perl script is more accurate? - Yes / No / Maybe?

I like how Kanazawa's demo helped me deduce repeat content is ~ 10% of genome, but seems like that also would just be a 1st approximation?

Has anyone encountered such discrepancy before, or know whether one is more reliable that the other? I am interested in deducing repeat content, but due to discrepancy, I'm not sure if my ~ 10% inference is even remotely accurate. Thoughts, anyone? Thank you!

k-mer jellyfish genome size • 3.2k views
4
Entering edit mode
5.2 years ago

BBMap's KmerCountExact does explicitly deal with multiple peaks to generate a genome-size estimate. However, that graph only has one prominent peak, so I would not expect the difference to be very significant. The primary peak is at 50x and there is a very minor secondary peak at 99x.

Manually, I calculated the size of the genome at approximately 54.8 Mbp, where around 50 Mbp is single-copy. That's specifically 50067742 at 1-copy + 1024732 at 2-copy + 724848 at 3-copy + 364704 at 4-copy + 2689529 at higher-copy (all are estimates).

3
Entering edit mode
5.2 years ago
Anand Rao ▴ 550

This reply is from Joseph Ryan - faculty at U-FL (http://ryanlab.whitney.ufl.edu/). With his permission to post here on his behalf. He is also author of the Perl script that my post refers to.

I suspect that the multiple peaks in your data are indicative of a highly repetitive genome. I have not tested my program on highly repetitive data and I suspect that estimate_genome_size.pl only gives an estimate of the single-copy proportion of the genome, which would mean that it is an underestimate in most eukaryotic genomes. I have tested my algorithm on a simulation of the published Hydra data. It gave an accurate prediction of the original assembly, but I'm guessing that this assembly lacked much of the repetitive content present in the actual genome. Nishiyama's algorithm looks quite similar to mine, and like mine, does not seem to deal explicitly with multiple peaks, so I'm not sure if either software is producing an accurate estimate. The best thing to do would be to simulate data, and see how the two algorithms perform. I would start with a genome or human chromosome and simulate using the procedure I outline here:

https://github.com/josephryan/estimate_genome_size.pl#if-you-dont-believe-it-works

I would also artificially introduce lots of repeats to the starting dataset, and then (1) see if the repeats introduce multiple peaks, (2) see how this affects the estimation of the two programs. I am very curious and would love to do this if I had grant money to maintain my software, but unfortunately, I don't and need to dedicate my time to fundable projects.

In the meantime, I have added a note of caution to users on my github repository.

Thanks for E-Mailing me. Feel free to post my reply on biostars.

1
Entering edit mode

Thanks for following up and posting this informative reply.