I am facing a problem while estimating genome size using jellyfish. We have illumina reads for a shrimp and have done kmer analyses using kmers of 17 upto 32. All the histos when observed have dual peaks but when compared, the second peak does not change according to kmer size. So we considered the second peak as homozygous peak and took the peak height as coverage, calculated the genome size. But it grossly underestimates the genome size when compared to the estimation done using flow cytometry.
So now we are confused as to whether we should completely omit the first peak at all. Please suggest an approach or formula to estimate near accurately.
For the genome size estimation using K-mers, you have to consider all the distinct kmers. So in your case you should also consider the heterozygous peak. The heterozygous peak adds up a large part because it gives you exactly double the amount of distinct kmers with exactly half the coverage. I think you should add the hetero and homozygous peaks and then use the collective coverage in your genome size calculation. For instance: if your homozygous peak is at 100 and your heterozygous peak is at 50, your collective coverage will be 150. Also, this scenario fits well with diploid species.
Thanks for your JstRoRR. If I simply add both the coverages and use it for calculation, then my genome size will further go down. somewhere is read we have to take mean of both coverages. But what i need is a solid calculation, am not sure what to follow.
Are you calculating it manually? have you tried this online tool http://qb.cshl.edu/genomescope/ It gives you a model based genome unique length estimation.
Thanks a lot. I nvr knew about the online tool, we were doing manually so far. We have histos from 17mer upto 32mers. We will run them in the online tool and see what we get.
Thanks again. kk