Question

Estimating genome size

2

Entering edit mode

4.9 years ago

Ketil 4.2k

So I am trying to estimate the size of a genome, and getting confusing and inconsistent results. So far I have:

assembly size. All of them tend towards a cumulative size 600-700 Mbp assemblies, depending on the assembler and what sequence data is used, with the higher numbers from assemblies considered better quality. Mapping back reads, DNA, fosmids, etc, show that the assembly is pretty much complete, N50 is good, and there are no obvious problems.
k-mer estimates on the inbred strain used in the assembly. Using the jellyfish recipe here: https://bioinformatics.uconn.edu/genome-size-estimation-tutorial/, I get an estimate of 1 Gbp
Edit: k-mer on pooled wild strains. k-mer coverage of 55x, estimated size is slightly above 1 Gbp, consistent with inbred
mapping estimates. Calculating coverages with samtools depth, integrating (summing coverages times counts) and dividing by modal coverage gives between 800 and 900 Mbp, with the larger libraries tending towards the lower end.
lab methods (not sure which¹, but not based on sequencing) consistently report a genome size of 1.6 Gbp.

Questions: What am I doing wrong here? Is this kind of discrepancies to be expected (i.e., all these methods are close to worthless)? Are there other methods I can (easily) use to get more estimates?

¹ Edit: I checked. We have used staining densiometry and flow cytometry, both apparently give a size of 1.5-1.6 Gbp, using human and chicken as controls.

Assembly genome sequencing • 3.1k views

ADD COMMENT • link updated 4.9 years ago by JC 13k • written 4.9 years ago by Ketil 4.2k

0

Entering edit mode

do you have any other prior knowledge, such as ploidy state, heterozygosity level, ... ?

and yes, discrepancies are to be expected here (they are all estimates), but ideally they should all indicate kinda the same ballpark number .

ADD REPLY • link 4.9 years ago by lieven.sterck 15k

0

Entering edit mode

As far as I can tell, it is a regular diploid animal (copepod). One hypothesis is that the inbred specimens used for sequencing have lost parts of the genome, while the genome size experiments were done on wild specimens. I have sequences from wild individuals too, but currently they are unavailable, thanks to our IT department. Will keep you posted :-)

ADD REPLY • link 4.9 years ago by Ketil 4.2k

0

Entering edit mode

count the kmers. Kmer count may throw the light on genome size.

ADD REPLY • link 4.9 years ago by cpad0112 21k

1

Entering edit mode

As mentioned, I used jellyfish to do this. With kmers of 21, 25, 29, and 31, I get k-mer coverages of 20, 19, 18 and 17, and genome size esimates of 975, 981, 986 and 1017 Mbp respectively.

ADD REPLY • link 4.9 years ago by Ketil 4.2k

0

Entering edit mode

Sounds like it's a real biological difference between your inbred and wild type strains. 50 % of expected sounds a bit low.

I have almost always seen assemblies making up about 70-90% of the expected genome size (eg 0.550 / 0.75 Gbp, 2.2 /2.9 Gbp) etc. Remember manually assembled "finished" assemblies contain a lot of Ns of unknown sequence (telomere, centromeres, large repeats etc). That said, I have only ever worked with diploid, inbred or double haploid samples (no highly heterozygous samples).

ADD REPLY • link 4.9 years ago by colindaven 7.7k

0

Entering edit mode

I found some wild sequences, and ran the k-mer analysis. This gives slightly over 1Gbp. Will do mapping coverage when I get the BAM files.

ADD REPLY • link 4.9 years ago by Ketil 4.2k

score 0 · Answer 1 · 2020-09-01

0

Entering edit mode

4.9 years ago

JC 13k

As you said, all those are estimates, it is hard to know from raw data the real size (genome complexity, repetitive regions, unaccessible regions...), the most precise value are the one coming from densitometry and flow cytometry unless you sequence each chromosome in a single long read.

ADD COMMENT • link 4.9 years ago by JC 13k