Question: kmergenie to estimate genome size and then sequencing coverage
gravatar for cook284
16 months ago by
cook2840 wrote:

Hi all. I'm in the process of uploading a draft bacterial genome assembly to NCBi. NCBI asks that you give a coverage estimate based on #bps sequenced/ expected genome size x % of bps placed in final assembly. I have calculated this using the kmergenie estimate for expected genome size as this is a de novo project, the numbers are as follows: Forward read fastq file: Num reads:5261180 Num Bases: 1575030702 Reverse read fastq file: Num reads:5049184 Num Bases: 1511690223 (1575030702+1511690223) = 3086720925 (i.e total bps sequenced) kmergenie genome size estimate: 4727586 Actual assembly size: 4706279

This gave a coverage calculation of: (3086720925/ 4727586) x ((4706279/3086720925)x100)= 99.549304867

I am inexperienced but this seems a high coverage- does this calculation seem sensible?

ADD COMMENTlink modified 15 months ago by Rayan Chikhi1.2k • written 16 months ago by cook2840

For bacterial genomes high coverage sequencing is easily possible. Your genome size estimate (4.7 Mb) also is inline with what a bacterial genome would be sized at. Is that number similar to a reference genome already in NCBI (or a closely related species)?

ADD REPLYlink modified 16 months ago • written 16 months ago by genomax33k

Hi, Yes this genome size is similar to that of other closely related species on NCBI. It was more the method for calculation of coverage I was concerned with.

ADD REPLYlink written 15 months ago by cook2840

I don't understand your math. The coverage is (bases sequenced)/(genome size) which is 3086720925/4706279=656. If you want the coverage of reads placed in the final assembly, you'll have to map them and then use the total number of mapped bases as the numerator instead (BBMap will print the coverage after mapping if you include the flag "covstats=covstats.txt").

ADD REPLYlink written 16 months ago by Brian Bushnell14k

Hi Brian, Thank you. I misunderstood what was required as the % bps placed in the final assembly (hence calculating it as number of bps in the assembly as a % of total number of bps sequenced). I have now mapped the reads as below: $ bowtie2 -p 6 -f -x Str113_genomeidx -1 Strain113_S52_R1_001.fasta -2 Strain113_S52_R2_001.fasta --very-sensitive -X 1000 -I 200 | samtools view -bS - > Str113.bam So I can use bedtools genomecov to estimate the coverage.

ADD REPLYlink written 15 months ago by cook2840

Hi again Brain. This is what NCBI ask for: "The estimated base coverage across the genome, eg 12x. This can be calculated by dividing the number of bases sequenced by the expected genome size and multiplying that by the percentage of bases that were placed in the final assembly. More simply it is the number of bases sequenced divided by the expected genome size." I think this is depth, as opposed to coverage?? 98.55% of the reads mapped back to the assembly by Bowtie2. For this NCBI calculation would it be correct to do: (3086720925/4706279=656)* 0.99= 649.44 This seems an extraordinarily large figure for coverage?!

ADD REPLYlink written 15 months ago by cook2840

Not really. Coverage can be anything. 600x for a bacteria is not unusual.

ADD REPLYlink written 15 months ago by Brian Bushnell14k
gravatar for Rayan Chikhi
15 months ago by
Rayan Chikhi1.2k
France, Lille, CNRS
Rayan Chikhi1.2k wrote:

Hi, the discussion looks correct to me: it seems that you have ~650x coverage. Kmergenie can be used to predict an assembly size. But if you have already performed an assembly, then it is better to use the actual assembly size that you obtained. Note that for larger genomes, neither kmergenie, nor your assembly, provide true genome sizes: because duplicated genome sequences are generally present only in one copy in the assembly. Thus the assembly size is generally shorter than genome size. But for bacterial genomes, assembly size and genome size are usually very close.

ADD COMMENTlink modified 15 months ago • written 15 months ago by Rayan Chikhi1.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1201 users visited in the last hour