Question: How to calculate overall coverage in de novo metagenomics assembly?
gravatar for SmallChess
5.2 years ago by
SmallChess530 wrote:

Let's say I have a reference genome and I sequence it into short-reads. Then, I will fed the reads to velvet to create a de novo assembly.

Let's say I have two or more contigs assembled (but not the entire genome). velvet also reports k-mer coverage for each of the contig.

For example, if AGCGGCC is my reference genome, my two assembled contigs are AG (the first two bases) and CC (the last three bases). I'm also given k-mer coverage for AG and GCC, 10.0 and 20.0 respectively.

How to find the overall coverage for the genome? In RNA, we can calculate something like RPKM abundance for a transcript but is there anything like that in metagenomics? Does my question even make sense? I know everything about my reference genome, can I report anything like coverage (or abundance) for the reference genome?


The Ray assembler gives biological abundances statistic. Is this the coverage that I'm trying to find?

velvet metagenomics • 4.5k views
ADD COMMENTlink modified 5.2 years ago by Brian Bushnell17k • written 5.2 years ago by SmallChess530
gravatar for Brian Bushnell
5.2 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

Since you have known references, coverage for the reference has nothing to do with an assembly, or assemblers, or kmers for that matter. Concatenate the references together and map all the reads to them to calculate coverage. For example, with BBMap: ref=concatenated.fasta in=reads.fq covstats=covstats.txt scafstats=scafstats.txt
ADD COMMENTlink modified 10 months ago by RamRS28k • written 5.2 years ago by Brian Bushnell17k
gravatar for Josh Herr
5.2 years ago by
Josh Herr5.7k
University of Nebraska
Josh Herr5.7k wrote:

You question is confusing to me and is not very well communicated -- do you want to calculate coverage for a genome sequencing project or a metagenomic sequencing project?

Calculating coverage for genome sequencing project is very straightforward -- there is plenty out there to help you figure it out.

Calculating coverage for a metagenome assembly is not straightforward.  First of all, you have no idea of the genome complexity and qualities of your "template" DNA.  You'll have many different strains which represent distinct OTUs which provide overlapping coverage.  Because of these qualities, How to assess coverage of Ray metagenomic assemblies.  Even with mock communities barely approaching the diversity in "real" metagenomic samples, you'll only be sequencing a small portion of your overall template -- best case scenario is about 5 to 10 % of metagenome sequencing reads will actually assemble.  You therefore have to understand all the caveats of metagenome assembly and coverage when communicating any numbers relating to your research.

What I do is simple and perhaps not the best solution (but I am not aware of any others -- and I've looked -- most people do this): map reads with bwa or bowtie to your assembly (you won't get many, but you can see assembly "hot-spots") and communicate the caveats.  

ADD COMMENTlink modified 5.2 years ago • written 5.2 years ago by Josh Herr5.7k

Thanks for the link! I'm a bit confused, that's why I'm asking. I've checked Ray assembler, it has something like biological abundances, do you think this means coverage of a reference genome?

Please check:

ADD REPLYlink modified 10 months ago by RamRS28k • written 5.2 years ago by SmallChess530

Still not sure if you're talking genome or metagenome assembly -- this matters on the issue as they are not the same.

If you look at the link I posted from a previous questions about Ray, you'll see it uses k-mers to measure coverage.  Don't confuse k-mer coverage with actual read coverage as strain diversity and similar OTUs will affect this.  

Furthermore, you mention "reference genome" -- in a metagenomic sample how do you know what your reference genome is? 

ADD REPLYlink written 5.2 years ago by Josh Herr5.7k

Sorry I made my questions unclear because I'm struggling with the subject (it's quite technical). I actually have a known microbial community that I can use it to simulate reads. The goal is to evaluate how each de novo assembler such as velvet perform, relatively to the community from where the reads come from. I know I can get k-mer coverage for a contig easily, but I'm struggling to understand if I can also calculate k-mer coverage or actual read coverage for an organism. I asked because I'm not even sure my question makes sense. Everywhere, I see people talk about k-mer coverage for a contig, but what about the reference genome? Would that be possible or make sense to calculate coverage for the genome?

ADD REPLYlink modified 10 months ago by RamRS28k • written 5.2 years ago by SmallChess530

One more time: Is this a metagenome (unknown reference) or a synthetic microbial community (known reference genomes)?  This matters here if you can use k-mers or not to estimate coverage.

After your last comment here, I'm just confused what exactly you are looking to do.  What is your research question?

ADD REPLYlink written 5.2 years ago by Josh Herr5.7k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 924 users visited in the last hour