Question

Assembling de novo with multiple versions of paired-end reads

3

Entering edit mode

7.6 years ago

kingcohn ▴ 30

I am attempting to assemble a 30Mega base genome using Illumina paired-end data. In order to get decent depth of coverage, the sequencing was done twice and we have four .fastq files. F & R for version one (roughly 80% coverage) and F & R for version two (~20% coverage). I'm unsure how to assemble from these data, should I concatenated all forward and reverse reads, from the two version, together, then merge using PEAR...etc. and then assemble or merge the two sets of F & R reads for each version, then assemble?

Assembly de novo paired-end versions Illumina • 4.2k views

ADD COMMENT • link 7.6 years ago by kingcohn ▴ 30

score 1 · Answer 1 · 2016-09-06

1

Entering edit mode

7.6 years ago

igor 13k

Are version one and two from the same library? In that case, you can just concatenate them. Otherwise, keep them separate.

Do not merge R1 and R2. Any assembler will handle paired-end reads properly.

There was a big project Assemblathon that published a thorough review of different assemblers: http://gigascience.biomedcentral.com/articles/10.1186/2047-217X-2-10

They used three different species with 1.0-1.6 Gb genomes, which are somewhat larger than yours, but are still fairly comparable.

ADD COMMENT • link 7.6 years ago by igor 13k

0

Entering edit mode

I disagree with this advice; merging can substantially improve assembly, though it depends on the specific assembler, merging tool, and insert size distribution.

ADD REPLY • link 7.6 years ago by Brian Bushnell 20k

0

Entering edit mode

True. Most of the time you wouldn't merge though. For example, you may lose the repeats. Optimally, your fragment size should be longer than both of your reads. In that case, merging would not work at all.

Sometimes having fewer reads will also improve the assembly, but you wouldn't advise people to generate fewer reads.

ADD REPLY • link 7.6 years ago by igor 13k

1

Entering edit mode

Actually... when fewer reads improve the assembly, I would advise people to reduce the number of reads, either by normalization or subsampling. At that point it's too late to generate fewer reads. And the optimal fragment size is not necessarily longer than both reads, specifically because merging can be beneficial. Because the error rate increases toward the end of the read, which is the part most likely to overlap, merging can substantially reduce the overall error rate.

I'm currently doing an analysis of optimal preprocessing of metagenomes prior to assembly. So far, on multiple datasets, merging is universally beneficial for Spades (an assembler that makes very good use of paired reads). It's also universally beneficial for Tadpole (which does not make use of paired reads). It appears to be neutral or detrimental to Megahit. However, even then, running BBMerge with the "ecco" flag (which error corrects read pairs in the overlap region via consensus, but outputs the reads as a pair rather than merging them) is universally beneficial to Megahit in my tests so far. Purely overlap-based error-correction is only possible with overlapping paired reads.

From prior data, it seems that merging is detrimental to Soap but beneficial for Ray. I have not tested the "ecco" mode with Soap, though.

So, yes, I would recommend that people design their libraries to overlap for assembly projects to take advantage of this, which is why JGI designs their libraries to overlap. I'm not really sure what you mean by "For example, you may lose the repeats." I have not seen this occur nor can I think of a reason why it might be incurred by merging. And lastly, it actually is possible to merge non-overlapping reads; BBMerge can do this, for example, with the "rem" flag. This greatly improves Tadpole assemblies because it allows the use of much longer kmers, and Tadpole is reliant on long kmers for a good assembly because it does not perform any graph simplification. I've not tested it thoroughly with other assemblers, but typically I find the effects of preprocessing to be highly correlated between Tadpole and Spades.

ADD REPLY • link 7.6 years ago by Brian Bushnell 20k

0

Entering edit mode

Thanks for clarifying!

I don't mean to imply you are wrong, but if the merging is beneficial, why do the assemblers not explicitly recommend it in their documentation or just implement it themselves? For example, SPAdes has a few pre-processing steps. Why not add read merging to that workflow?

ADD REPLY • link 7.6 years ago by igor 13k

0

Entering edit mode

Allpaths-LG did merging internally (and explicitly requires libraries with a substantial fraction of overlapping reads), and I think there is at least one other assembler whose name I forget that does it as well. And it is almost universally performed (or should be) prior to overlap-based assembly. I'm not sure why it is not explicitly recommended by assemblers, but perhaps the reason is because assembler developers tested with a merging tool that had a high false-positive rate and concluded that merging led to inferior assemblies, due to the creation of chimeras. Unlike other preprocessing steps, merging can introduce new errors, which is why BBMerge is very conservative.

Also, some tools like Spades are designed around certain assumptions, like a bell-shaped distribution of paired insert sizes. Merging will destroy that assumption, but even so, it still improves the assembly! Similarly, the Spades team does not recommend or internally perform normalization, partly because it messes with their path-simplification heuristics. Even so, in some cases normalization improves single-cell assembly (more often it's neutral or marginally worse) and in all cases the normalized assembly uses vastly lower resources (time and memory), which often means the difference between an assembly and no assembly. I don't recommend normalization as a universal preprocessing step; the point is that it is often useful for Spades, but explicitly not recommended. Why? Well, extensively testing all of your assumptions (particularly those you designed an algorithm around!) is very time-consuming; I'm sure the team is busy improving other aspects of Spades.

ADD REPLY • link 7.6 years ago by Brian Bushnell 20k

0

Entering edit mode

From SPAdes changelog for SPAdes 3.12.0 (May 2018):

NEW: Support for merged paired-end reads.

And now in the manual:

If you have merged some of the reads from your paired-end (not mate-pair or high-quality mate-pair) library (using tools s.a. BBMerge or STORM), you should provide the file with resulting reads as a "merged read file" for the corresponding library.

ADD REPLY • link 5.9 years ago by igor 13k

score 0 · Answer 2 · 2016-09-07

0

Entering edit mode

7.6 years ago

Rohit ★ 1.5k

A 30MB genome, so I guess it might be a bacterial genome. I think you mean 80X and 20X coverage, since % is a completely different measure. Spades assembler might be ideal in your case. The concatenation of paired-end data into single reads usually increases the assembly contiguity, so go for it. In the end you would have to use, two paired-end libraries and two single-end libraries. Check out the spades manual -

http://spades.bioinf.spbau.ru/release3.9.0/manual.html#sec3.4

ADD COMMENT • link 7.6 years ago by Rohit ★ 1.5k

0

Entering edit mode

Probably not bacterial. Largest prokaryotic genome is 13 MB, at least as of 2013 ( http://www.nature.com/articles/srep02101 ).

ADD REPLY • link 7.6 years ago by igor 13k

0

Entering edit mode

Then Fungal, it is... :)

ADD REPLY • link 7.6 years ago by Rohit ★ 1.5k

0

Entering edit mode

True. It's hard to guess with that size.

ADD REPLY • link 7.6 years ago by igor 13k

score 0 · Answer 3 · 2016-09-07

0

Entering edit mode

7.6 years ago

kingcohn ▴ 30

Insect. It's an insect genome, well several related species each with around a genome around 1.1Giga bases* not 30Mb, which is roughly one chromosome. Anyway, I'm attempting to first identify large genomic variations between species, then call variants among. But total coverage is pretty low at only around 20x, so the first set of reads is around 16x coverage...etc. Thanks for the suggestions, all! I have a feeling I'll be inquiring pretty regularly.

ADD COMMENT • link 7.6 years ago by kingcohn ▴ 30

0

Entering edit mode

20X is low for assembly. It's even on the low end for variant calling, which is a more lenient process.

ADD REPLY • link 7.6 years ago by igor 13k

0

Entering edit mode

I agree... 20X for denovo assembly would give you fragmented contigs, 60X is recommended even though 40X would be enough. Also, why go for assembly if genomic variation compared to a reference is your aim?

For variant calling, 15X is the bare minimal, depending on how much read-support you have for your regions of interest. 30X should be good for SVs though. If you plan on more experiments, long-read data or mate-pairs would be a better option for large genomic variation in-addition to paired-end.

ADD REPLY • link 7.6 years ago by Rohit ★ 1.5k

score 0 · Answer 4 · 2016-09-08

0

Entering edit mode

7.6 years ago

kingcohn ▴ 30

Yeah, we're aware of the limitations and the data was originally designed to align with reference genome of related species via BWA , which I've done. But I am curious to see if we can assemble contigs with comparable quality without a reference, which might illustrate unique features in our system. Greater coverage is ideal and would you describe what you mean by fragmented contigs? Much appreciated.

ADD COMMENT • link 7.6 years ago by kingcohn ▴ 30

1

Entering edit mode

Hi kingcohn, please use ADD REPLY/ADD COMMENT to answer to an earlier post, as such threads remain logically structured and easy to follow. Thanks!

ADD REPLY • link 7.6 years ago by WouterDeCoster 47k

0

Entering edit mode

When you are doing assembly, you ideally end up with the number of contigs that is equal to the number of chromosomes. That would be a closed genome. However, that is essentially impossible with short reads (although you can add mate-pair libraries to greatly improve the results). Even small genomes (<1 MB) are hard to close. If you can't even assemble 1 MB of sequence, imagine how many contigs you will get with 1 GB.

ADD REPLY • link 7.6 years ago by igor 13k