Question

Does more data lead to better genome assembly?

2

Entering edit mode

9.2 years ago

rus2dil ▴ 20

I have NGS raw sequences for rice varieties. For a single sample I have 6 runs. When assembling the complete genome, do I need to consider all 6 runs? or just one run is enough? And if I need all 6 runs when assembling the genome do I need to merge sequences before mapping or after mapping?

sequencing alignment Assembly • 2.2k views

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by rus2dil ▴ 20

score 2 · Answer 1 · 2015-02-12

More data = better depth. It is usually good, but you'll end up increasing computational complexity dramatically. If you normalize to ease complexity, your data size will go down anyway.

After a coverage of 80-100X, increases in coverage won't make any tangible difference, IMO. 80-100X is our lab's platinum standard for clinical precision, but we deal with human beings, so there's a disclaimer.

Ram · Answer 2 · 2015-02-13

My lab deals with all kinds of organisms, but still, we don't notice an increase in utility of data over ~150x, for assembly. Indeed, we tend to subsample or normalize to get data down to 100x-150x if it is over that.

However, it depends on the experiment and organism. Polyploid organisms (like many plants) can benefit from, say, 100x per ploidy. If you sequence a hexaploid at 100x, you get 16x average per ploidy, which may not be enough to assemble.

For experiments like RNA-seq, with an exponential coverage distribution - or anything where you expect highly biased coverage - the more, the better; there's really no upper limit. But, assembly may require normalization or an iterative approach (high-coverage and low-coverage areas assembled separately).

Ram · Answer 3 · 2015-02-13

Ignoring the issue of how you actually measure whether one assembly is "better" than another, in general depth != coverage != a "good" assembly, so you have to be careful in that you're not just adding data and therefore increasing analytical complexity just for the sake of it. Read quality information, contamination screening and filtering are essential. Quality trimming doesn't always lead to increased contiguity, so be careful there too. We've developed a tool called Kontaminant to do screening and filtering in kmer space. There are others, such as khmer from Titus Brown's lab.

Secondly, you'll need to use some kind of assessment tool to see how your library quality fits with your intended outcome. A group in our institute has developed KAT, the Kmer Analysis Toolkit, to help understand the quality profiles of NGS data based on kmer frequencies and motifs, outlined in this paper

And documentation and code can be found here.

Ram · Answer 4 · 2015-02-13

Hi rus2dil,

As you already generated data for your sample then concatenate data from 6 runs before assembly itself that will yield better assembly. About your prime main query "Does more data lead to better genome assembly?" : absolutely more data will give you better assembly but again it depends on quality of data you generated and one think as you have more than enough data filter out low quality reads and consider uniq reads.

Correct me if am wrong.