Question

many reads are not assembled into long contigs

0

Entering edit mode

3.5 years ago

mrzhsy • 0

Hi folks,

I am assembling paired-end short metagnenomic reads (150 bp) using metaSpades and Megahit (with default settings). Then I did binning (with various binners and refiners), only 35-40% trimmed clean reads could be mapped to the bins. This is criticized by reviewers that I did not capture the majority of the community.

I suspect the issue is at the assembly step where contigs > 1kb only account for ~ 40% of the total contig length, in another word, many reads are not assembled into long contigs that could be used for binning. Is there any way to improve assembly performance?

PS. One plausible explanation is the samples were from anaerobic digester sludge, where there are massive dead cell/fragamented DNA/eDNA/other noises and high diversity that collectively lead to many reads naturally not accessible to assembly. Does this make sense?

Thanks in advance. Ran

Metagenomics assemble • 2.1k views

ADD COMMENT • link updated 3.5 years ago by Mensur Dlakic ★ 29k • written 3.5 years ago by mrzhsy • 0

score 0 · Answer 1 · 2021-12-20

0

Entering edit mode

3.5 years ago

Mensur Dlakic ★ 29k

There is nothing wrong with ~40% of your sequence being in contigs > 1kb. I randomly checked two of my metagenomes, and they had 37% and 44% of total sequence in contigs > 1kb. I have seen instances of less than 20% of contigs fulfilling this criterion.

I think you should report how many reads mapped to the total assembly rather than just the > 1kb contigs. After all, reads contribute to the whole assembly, not just the contigs you selected. If you got less than 65-70% mapping rate to the whole assembly, then I'd be worried. But it is to be expected that a mapping rate will be smaller than that for contigs that represent only 35-40% of your assembly, and I don't see anything worrisome there.

ADD COMMENT • link 3.5 years ago by Mensur Dlakic ★ 29k

0

Entering edit mode

Thanks for your comments. I checked the reads mapping ratio on the entire assembly. It is 85% for Spades-assembly, 82% for megahit-assembly, and 80% for IDBA-assembly. And the the average coverage is ~10x, which is reasonable to because the sludge-derived samples were naturally complex.

In addition to the nature of the digester sludge sample, I guess the way we used for DNA extraction (bead-beating) also contributed to the fragmented community DNA pool.

Ran

ADD REPLY • link 3.5 years ago by mrzhsy • 0

0

Entering edit mode

I think you are making a wrong assumption that your original sample was very fragmented because your assembly is such. There are many reasons for a fragmented assembly that have nothing to do with sample prep. For example, with short reads low coverage doesn't land itself to very strong assembly. And since your average is 10x, there is a good chance that many of your individual bins are < 5x. That won't assemble well, especially if there is a good number of similar subtypes in your community.

You probably got as good of a result as one could expect from a 10x assembly with short reads. You probably didn't catch the whole diversity, specifically those organisms with very low abundance. But that's a problem for any complex community, and it has nothing to do with your assembly or the mapping rate. You have what you have, and it is not realistic for a reviewer to expect that you will be able to squeeze more diversity from your existing sample. I mean who repeats the whole sampling and assembly because they didn't catch maybe the last 5-10% of organisms with extremely low abundance?

My suggestion is to interpret what you have, and to remind the reviewer(s) that with 10x average coverage one can't expect to get them all.

ADD REPLY • link 3.5 years ago by Mensur Dlakic ★ 29k

score 0 · Answer 2 · 2021-12-20

As to improving the assembly, it depends on the number of MAGs/bins you have; also on sequencing depth. If you have very high sequencing depth (say, > 500x on average), non-random sequencing error will introduce "mutations" that can't be corrected by a regular assembly process, and will lead to contig fragmentation. This can also lower the mapping rate. If this is a case with your data, thinning down your reads may result in a better assembly. By better I mean primarily fewer contigs and longer sequences, but your binning fraction may not move much from 40%.