Hi folks,
I am assembling paired-end short metagnenomic reads (150 bp) using metaSpades and Megahit (with default settings). Then I did binning (with various binners and refiners), only 35-40% trimmed clean reads could be mapped to the bins. This is criticized by reviewers that I did not capture the majority of the community.
I suspect the issue is at the assembly step where contigs > 1kb only account for ~ 40% of the total contig length, in another word, many reads are not assembled into long contigs that could be used for binning. Is there any way to improve assembly performance?
PS. One plausible explanation is the samples were from anaerobic digester sludge, where there are massive dead cell/fragamented DNA/eDNA/other noises and high diversity that collectively lead to many reads naturally not accessible to assembly. Does this make sense?
Thanks in advance. Ran
Thanks for your comments. I checked the reads mapping ratio on the entire assembly. It is 85% for Spades-assembly, 82% for megahit-assembly, and 80% for IDBA-assembly. And the the average coverage is ~10x, which is reasonable to because the sludge-derived samples were naturally complex.
In addition to the nature of the digester sludge sample, I guess the way we used for DNA extraction (bead-beating) also contributed to the fragmented community DNA pool.
Ran
I think you are making a wrong assumption that your original sample was very fragmented because your assembly is such. There are many reasons for a fragmented assembly that have nothing to do with sample prep. For example, with short reads low coverage doesn't land itself to very strong assembly. And since your average is 10x, there is a good chance that many of your individual bins are < 5x. That won't assemble well, especially if there is a good number of similar subtypes in your community.
You probably got as good of a result as one could expect from a 10x assembly with short reads. You probably didn't catch the whole diversity, specifically those organisms with very low abundance. But that's a problem for any complex community, and it has nothing to do with your assembly or the mapping rate. You have what you have, and it is not realistic for a reviewer to expect that you will be able to squeeze more diversity from your existing sample. I mean who repeats the whole sampling and assembly because they didn't catch maybe the last 5-10% of organisms with extremely low abundance?
My suggestion is to interpret what you have, and to remind the reviewer(s) that with 10x average coverage one can't expect to get them all.