Question

Is there a tool that calculates genome coverage? (e.g., 1X, 10X, 100X)

0

Entering edit mode

22 months ago

O.rka ▴ 720

I'm submitting a bunch of genomes to NCBI and they want genome coverage:

The estimated base coverage across the genome, eg 12x. This can be calculated by dividing the number of bases sequenced by the expected genome size and multiplying that by the percentage of bases that were placed in the final assembly. More simply it is the number of bases sequenced divided by the expected genome size.

Is there a tool that does this or should I just write a method myself?

ncbi metagenomics submission • 1.6k views

ADD COMMENT • link 22 months ago by O.rka ▴ 720

score 0 · Answer 1 · 2022-07-14

0

Entering edit mode

22 months ago

Mensur Dlakic ★ 27k

There are many tools to calculate contig coverage. I recommend calculate_read_coverage.py from Autometa. Then you can get an approximate genome coverage by averaging coverage values from all contigs that belong to a given genome.

To get an exact genome coverage, a sum of (contig_length * contig_coverage) is divided by a sum of all contig lengths that belong to a genome.

ADD COMMENT • link 22 months ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Are you suggesting something like this?

        aggregated_coverage = np.sum(contig_to_size[contigs] * contig_to_coverage[contigs])
        genome_to_coverage[id_genome] = aggregated_coverage/genome_size

Wouldn't this over estimate the coverage because it would imply that a read covers the entire contig? I just ran it for my genomes and in some cases I'm getting 5623X which seems like a lot.

ADD REPLY • link 22 months ago by O.rka ▴ 720

0

Entering edit mode

In simplest terms, the coverage is all the bases sequenced (that belong to that genome) divided by genome size.

Your formula is correct. What you calculated is high coverage, but not unheard of if you are sequencing pure cultures. The easiest way to see whether that makes sense is to look at coverage of individual contigs. If they are in such high numbers, then the coverage of the whole genome will be high as well.

Wouldn't this over estimate the coverage because it would imply that a read covers the entire contig?

The coverage is not a simple sum of all the reads that cover any part of the contig, because you are right that would overestimate everything. The script I mentioned takes care of calculating the average coverage per base properly, which will be in file coverage.tab and look like this:

contig          coverage
k141_1056727    15.9906976744186
k141_568440     19.7450980392157
k141_117131     0
k141_241059     3.67483296213808
k141_513220     38.2857142857143
k141_10643      18.75
k141_145489     22.1052631578947
k141_258420     21.9610655737705
k141_656342     4.19158878504673
k141_36035      36.5017667844523
k141_272421     1.36474164133739
k141_330776     23.4240282685512
k141_627187     8.50357142857143
k141_886026     10.2403433476395

By the way, you may have to get the script from old Autometa release, as it may have been discontinued in most recent versions:

https://github.com/KwanLab/Autometa/releases/tag/1.0

ADD REPLY • link 22 months ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Oh I think that is probably the culprit, I was using counts from featurecounts. How is contig coverage calculated in the example above?

ADD REPLY • link 22 months ago by O.rka ▴ 720

1

Entering edit mode

It is all in the script I referenced above, and I am fairly confident you know your way around the python code.

Briefly, the raw reads are mapped to your assembly using bowtie2, followed by samtools and a couple of custom perl scripts that come bundled in that version of Autometa I referenced above. If you already have a sorted .bam file by mapping the raw reads using some other program, you can bypass several steps. If you look through the code it should be clear.