How to assess coverage of Ray metagenomic assemblies
2
1
Entering edit mode
6.7 years ago
nfellaby ▴ 30

Having been really pleased with results from the Ray assembler I was a little disappointed to see that there are no coverage statistics in the fasta headers (as there would be in Velvet).

What would be the best way to assess the coverage of each contig assembled?

and for bonus points; what would be the easiest way to incorporate the coverage into the fasta headers?

currently looking into mapping reads against assemblies using bwa/samtools etc.

[Working with HiSeq 101bp PE data, 2 lanes. Not using Velvet because I literally don't have the computing resources to run it on all the samples]

Ray metagenomic coverage Assembly QC • 2.6k views
1
Entering edit mode
head project/BiologicalAbundances/_DeNovoAssembly/Contigs.tsv

0
Entering edit mode

If we still had the ability to downvote, I would downvote this comment apelin20.  Yes, you use head to look at fasta file headers -- how does this help the OP?

0
Entering edit mode

This is why we do not have the ability to downvote... because ignorance can lead to many useless downvotes, such as the one you suggest. I am not suggesting a head of a fasta file (I am not afterall showing the dir of a fasta file, tsv is not fasta, right?). I am suggesting to the OP that there is a file that contains all the info he is looking for. The k-mer coverage available in spades and velvet in available in the file mentioned. You simply need a script to replace fasta headers with that info, let me know if you need any help.

2
Entering edit mode

If you actually read the question the OP is asking how to measure read coverage when coverage is NOT provided in the fasta headers or any other file (which is the situation with the Ray assembler).

0
Entering edit mode

1
Entering edit mode

"Mode k-mer coverage depth" isn't coverage though... Where does it define base coverage in the Contigs.tsv?? However, if I wasn't going to map reads back to the assemblies this could prove useful.

0
Entering edit mode

Well, mode and mean are similar things right, it's not like you can have coverage for every position. Anyways: The relation between k-mer coverage Ck and standard (nucleotide-wise) coverage C is Ck = C * (L - k + 1) / L where k is your hash length, and L you read length.

1
Entering edit mode

Let's be clear -- "Mode k-mer coverage depth" is something different from "Mean read coverage"

apelin20, I took offense to your original comment here because you did not put in any effort to help the OP and I feel you did this in a flippant and arrogant way with your "head project/BiologicalAbundances/_DeNovoAssembly/Contigs.tsv, duh?" comment.  You're new here and only been a member a couple of weeks -- just relax.

1
Entering edit mode

I'm glad I wasn't imagining that these are different measures.

I would say that currently I am performing the guide found at the bottom of the comments  (bwa alignment) and it appears to be working rather well. A bit of a faff but once scripted should be able to perform the task I was after.

0
Entering edit mode

It's not as different as you think. There 2 measures are HIGHLY correlated. That's because k-mer coverage is directly convertable to read coverage, so mode k-mer depth can be directly converted to mode read depth. And mode read depth and mean read depth, are both useful measures in assessing coverage.

You should show some professionalism and stop involving your emotions in a bioinformatic forum. Also, I am not NEW here, my account is, I have been on this forum at least a year.

Now let me be clear to YOU. The solution to what the OP was asking is a very simple one. All OP needed was to know where the information was stored. I did not have the time to write a novel about that file, and I didn't need to, it is very self-explanatory. So, I pointed him in the right direction. Albeit, I did not expect to waste so much time on your feelings. Even if I was a new member, your attitude discourages people from contributing, by setting bizzare arbitrary standards where your feelings aren't hurt. There was no "duh ?" in my comment, you are imagining things. You were wrong when you said my solution wasn't correct, just admit it and move on.

0
Entering edit mode

apelin20: This is the last time I comment on this string -- I don't like this back and forth between us -- but I want to make some things clear:

1. I found your initial post insulting to the OP because the question addressed how to calculate read coverage from the Ray assembler and how to incorporate this into the fasta file headers and you responded without any helpful context to "show" the k-mer spectra output of Ray.

2. Mode and mean are two different things -- please be aware of this.

3. I would agree that mode k-mer spectra has some correlation with mean read coverage for bacterial genome analysis (less so with larger duplicated genomes), but this is not correlated in metagenomic samples where strain variation and taxonomic diversity cause the two to be uncorrelated. This does not help the OP who is working with metagenomes and originally asked about read coverage of samples.

I have taken offense to the unhelpfulness of your comments, how you have disregarded the original question as trivial when I think it's certainly not trivial, and your tone here. I see you've had some trouble on other forums and perhaps your "new" account here is a reflection of previous issues.

0
Entering edit mode

I think your speculation is way past reasonable, as a moderator I really expected you to show more maturity and professionalism. I simply use that account when at home, using a different gmail account, because it's an automatic way to log in. Not sure what your research into my personal life has uncovered for you, but judging from your logic so far I wouldn't put too much weight into it. In my time on this forum, I have never found someone to be so disrespectful, or show any disrespect whatsoever, this is a first.

OP, sorry this got out of hand. You asked statistics similar to ones outputed by Velvet. I have showed you a file containing similar info. I did however missed this part of your question as pointed out by Josh Herr

and for bonus points; what would be the easiest way to incorporate the coverage into the fasta headers?

Here is a 1-liner to rewrite fasta headers with info from a table:

awk 'NR==FNR{array[">"$1]=(">"$1 "_coverage=" $6);next} {if($1 in array){$1=array[$1]}}1' TABLE INPUT_FASTA.fasta > OUTPUT_FASTA.fasta


In this example you can use INPUT_contigs from Ray and the Contigs.tsv file to rewrite the headers adding info about k-mer coverage.

This is also useful if you have a table with your headers and mean bowtie coverage.

0
Entering edit mode

Adrian Pelin, The above explanation is much better than the random one liner (without any explanation) which failed to address the OP's question.  Thank you for the clarification.

Whatever you think of my interactions with you here, I stand by what I have said.

I have to admit your multiple accounts is a little inappropriate -- Is it too hard for you to use a single login? I see you upvote one account with the other?

I've been a co-author with Nicholas and work on Glomus, among other things, so next time the three of us are at a meeting together, let's sit down for a friendly drink -- the rounds will be on me.  Cheers!

0
Entering edit mode

;) I know about the AMF paper, that was a fun one.

The upvote is an isolated incident, you can hold me to that. As far as the double account, it originally happend by mistake, in the sense that i was logged into my personal gmail account at home, while at work i am logged into the work gmail, and I use autologin based on gmail.

You just need to understand that sometimes people don't have the luxury of time to write detailed answers. In fact, it is common here that someone will post a quick answer, and if OP is interested, elaboration follows. This is all done in a calm, non-threatening manner (i.e. you don't insult people with down votes, you simply ask for further explanation if you are interested in the solution).

2
Entering edit mode
6.7 years ago
Josh Herr 5.7k

Coverage is inherently tricky in metagenomic sample assembly.  There are many caveats to consider -- one being assembly chimeras, two similar sequences assembling from syntenous regions between two strains or species, which can wreck havoc on things.

I would take your "final" assembly and then map your sequencing reads back onto the assembly using bwa or bowtie and inspect coverage from there.

You should keep in mind that this might not be the best way to gauge the number of a specific taxon (or functional annotation metric) you have in a sample as typically only a small number of reads will assemble in metagenomic samples.  This all depends on the complexity of your sample and the depth of your sequencing.  I usually get anywhere from 5% to 20% of my metagenome reads to assemble, so you might not get a good basis of coverage from mapping short reads.

There are some other metrics of measuring coverage, such as k-mers.  There are lots of tools to use, but one that I would suggest here is khmer's tools for metagenomic reads.

Not sure what you mean exactly about coverage in fasta headers -- SPAdes for example (like Velvet) will estimate coverage and include it in the assembled fasta headers -- I suppose this is what you are asking about?  I would look at the SPAdes manual -- I think the way they address coverage in the fasta headers is informative and unobtrusive.  I don't know of any tools to automate this, so you'd probably have to do it yourself or take Velvet/SPAdes apart and see how they do it.