Question

kmergenie to estimate genome size and then sequencing coverage

0

Entering edit mode

7.9 years ago

cook284 • 0

Hi all. I'm in the process of uploading a draft bacterial genome assembly to NCBi. NCBI asks that you give a coverage estimate based on #bps sequenced/ expected genome size x % of bps placed in final assembly. I have calculated this using the kmergenie estimate for expected genome size as this is a de novo project, the numbers are as follows: Forward read fastq file: Num reads:5261180 Num Bases: 1575030702 Reverse read fastq file: Num reads:5049184 Num Bases: 1511690223 (1575030702+1511690223) = 3086720925 (i.e total bps sequenced) kmergenie genome size estimate: 4727586 Actual assembly size: 4706279

This gave a coverage calculation of: (3086720925/ 4727586) x ((4706279/3086720925)x100)= 99.549304867

I am inexperienced but this seems a high coverage- does this calculation seem sensible?

kmergenie genome size coverage • 3.2k views

ADD COMMENT • link updated 7.9 years ago by Rayan Chikhi ★ 1.5k • written 7.9 years ago by cook284 • 0

0

Entering edit mode

For bacterial genomes high coverage sequencing is easily possible. Your genome size estimate (4.7 Mb) also is inline with what a bacterial genome would be sized at. Is that number similar to a reference genome already in NCBI (or a closely related species)?

ADD REPLY • link 7.9 years ago by GenoMax 141k

0

Entering edit mode

Hi, Yes this genome size is similar to that of other closely related species on NCBI. It was more the method for calculation of coverage I was concerned with.

ADD REPLY • link 7.9 years ago by cook284 • 0

0

Entering edit mode

I don't understand your math. The coverage is (bases sequenced)/(genome size) which is 3086720925/4706279=656. If you want the coverage of reads placed in the final assembly, you'll have to map them and then use the total number of mapped bases as the numerator instead (BBMap will print the coverage after mapping if you include the flag "covstats=covstats.txt").

ADD REPLY • link 7.9 years ago by Brian Bushnell 20k

0

Entering edit mode

Hi Brian, Thank you. I misunderstood what was required as the % bps placed in the final assembly (hence calculating it as number of bps in the assembly as a % of total number of bps sequenced). I have now mapped the reads as below: $ bowtie2 -p 6 -f -x Str113_genomeidx -1 Strain113_S52_R1_001.fasta -2 Strain113_S52_R2_001.fasta --very-sensitive -X 1000 -I 200 | samtools view -bS - > Str113.bam So I can use bedtools genomecov to estimate the coverage.

ADD REPLY • link 7.9 years ago by cook284 • 0

0

Entering edit mode

Hi again Brain. This is what NCBI ask for: "The estimated base coverage across the genome, eg 12x. This can be calculated by dividing the number of bases sequenced by the expected genome size and multiplying that by the percentage of bases that were placed in the final assembly. More simply it is the number of bases sequenced divided by the expected genome size." I think this is depth, as opposed to coverage?? 98.55% of the reads mapped back to the assembly by Bowtie2. For this NCBI calculation would it be correct to do: (3086720925/4706279=656)* 0.99= 649.44 This seems an extraordinarily large figure for coverage?!

ADD REPLY • link 7.9 years ago by cook284 • 0

0

Entering edit mode

Not really. Coverage can be anything. 600x for a bacteria is not unusual.

ADD REPLY • link 7.9 years ago by Brian Bushnell 20k

score 3 · Answer 1 · 2016-06-08

Hi, the discussion looks correct to me: it seems that you have ~650x coverage. Kmergenie can be used to predict an assembly size. But if you have already performed an assembly, then it is better to use the actual assembly size that you obtained. Note that for larger genomes, neither kmergenie, nor your assembly, provide true genome sizes: because duplicated genome sequences are generally present only in one copy in the assembly. Thus the assembly size is generally shorter than genome size. But for bacterial genomes, assembly size and genome size are usually very close.