Question: kmergenie to estimate genome size and then sequencing coverage
gravatar for cook284
3.5 years ago by
cook2840 wrote:

Hi all. I'm in the process of uploading a draft bacterial genome assembly to NCBi. NCBI asks that you give a coverage estimate based on #bps sequenced/ expected genome size x % of bps placed in final assembly. I have calculated this using the kmergenie estimate for expected genome size as this is a de novo project, the numbers are as follows: Forward read fastq file: Num reads:5261180 Num Bases: 1575030702 Reverse read fastq file: Num reads:5049184 Num Bases: 1511690223 (1575030702+1511690223) = 3086720925 (i.e total bps sequenced) kmergenie genome size estimate: 4727586 Actual assembly size: 4706279

This gave a coverage calculation of: (3086720925/ 4727586) x ((4706279/3086720925)x100)= 99.549304867

I am inexperienced but this seems a high coverage- does this calculation seem sensible?

coverage genome size kmergenie • 1.8k views
ADD COMMENTlink modified 3.5 years ago by Rayan Chikhi1.4k • written 3.5 years ago by cook2840

For bacterial genomes high coverage sequencing is easily possible. Your genome size estimate (4.7 Mb) also is inline with what a bacterial genome would be sized at. Is that number similar to a reference genome already in NCBI (or a closely related species)?

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by genomax75k

Hi, Yes this genome size is similar to that of other closely related species on NCBI. It was more the method for calculation of coverage I was concerned with.

ADD REPLYlink written 3.5 years ago by cook2840

I don't understand your math. The coverage is (bases sequenced)/(genome size) which is 3086720925/4706279=656. If you want the coverage of reads placed in the final assembly, you'll have to map them and then use the total number of mapped bases as the numerator instead (BBMap will print the coverage after mapping if you include the flag "covstats=covstats.txt").

ADD REPLYlink written 3.5 years ago by Brian Bushnell17k

Hi Brian, Thank you. I misunderstood what was required as the % bps placed in the final assembly (hence calculating it as number of bps in the assembly as a % of total number of bps sequenced). I have now mapped the reads as below: $ bowtie2 -p 6 -f -x Str113_genomeidx -1 Strain113_S52_R1_001.fasta -2 Strain113_S52_R2_001.fasta --very-sensitive -X 1000 -I 200 | samtools view -bS - > Str113.bam So I can use bedtools genomecov to estimate the coverage.

ADD REPLYlink written 3.5 years ago by cook2840

Hi again Brain. This is what NCBI ask for: "The estimated base coverage across the genome, eg 12x. This can be calculated by dividing the number of bases sequenced by the expected genome size and multiplying that by the percentage of bases that were placed in the final assembly. More simply it is the number of bases sequenced divided by the expected genome size." I think this is depth, as opposed to coverage?? 98.55% of the reads mapped back to the assembly by Bowtie2. For this NCBI calculation would it be correct to do: (3086720925/4706279=656)* 0.99= 649.44 This seems an extraordinarily large figure for coverage?!

ADD REPLYlink written 3.5 years ago by cook2840

Not really. Coverage can be anything. 600x for a bacteria is not unusual.

ADD REPLYlink written 3.5 years ago by Brian Bushnell17k
gravatar for Rayan Chikhi
3.5 years ago by
Rayan Chikhi1.4k
France, Lille, CNRS
Rayan Chikhi1.4k wrote:

Hi, the discussion looks correct to me: it seems that you have ~650x coverage. Kmergenie can be used to predict an assembly size. But if you have already performed an assembly, then it is better to use the actual assembly size that you obtained. Note that for larger genomes, neither kmergenie, nor your assembly, provide true genome sizes: because duplicated genome sequences are generally present only in one copy in the assembly. Thus the assembly size is generally shorter than genome size. But for bacterial genomes, assembly size and genome size are usually very close.

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Rayan Chikhi1.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1295 users visited in the last hour