Question

How To Close Gaps In A 454 Assembly In Silico?

2

Entering edit mode

15.1 years ago

Michael Barton ★ 1.9k

We've sequenced two ~7-9Mbp microbial genomes using 454 which was subsequently assembled with newbler. For the first bacteria we have 8 sequence scaffolds. These scaffolds contain gap regions which I assumed were the result of when the sequencing coverage dropped off. However when I look at the read depth for these regions the contig appears to terminate prematurely while there is still a large amount of read depth. I assume that these reads could still continue off the end of the contig but they have been ignored. I've been reading the newbler documentation and it seems to indicate that contig extension stops when there are repeats in the genome.

Can anyone offer any help on how we can close these scaffold gaps in silico? It's seems that we should have the sequence data to get across but I don't know how to do it.

sequencing genome assembly • 7.3k views

ADD COMMENT • link updated 15.1 years ago by User 59 13k • written 15.1 years ago by Michael Barton ★ 1.9k

score 5 · Answer 1 · 2010-05-21

Michael,

Welcome to the wonderful world of genome finishing. If the repeats are longer than the length of a read (300-600) for flx titanium (ballpark), you will not be able to span it. These areas may also be caused by homopolymer issues that this platform suffers from, or other mysterious artifacts. One option it to use software like CONSED or CLC Bio to visualize the areas, and work your way into the repeats by finding reads that are anchored in unique sequencer. Designing primers that span the areas and using Sanger sequencing may also be helpful. I assume you don't have a reference of any type to use in piecing things together?

You can also run a differ assembler and then do a mummer mapping to see if any of the areas were taken care of by the other assembler, you would be amazed at how different assemblers handle the same data differently.

score 2 · Answer 2 · 2010-05-21

2

Entering edit mode

15.1 years ago

Wjeck ▴ 490

Generally these gaps are very tricky to span, even with 454 reads, using in silico techniques only. You might have to try the wet bench solution to this, which is to use illumina PE reads with a large "insert" size to create a scaffold that jumps those gaps.

There's this project using that technique (shameless self promotion):

http://www.ncbi.nlm.nih.gov.libproxy.lib.unc.edu/pubmed/19015323

But I think others have made considerable improvements since then.

ADD COMMENT • link 15.1 years ago by Wjeck ▴ 490

0

Entering edit mode

Thanks for the suggestions. We're considering SRS for a second genome we have which is even more fragmented >50 contigs at X17 coverage. Probably a large number of repeats ...

ADD REPLY • link 15.1 years ago by Michael Barton ★ 1.9k

score 2 · Answer 3 · 2010-06-09

2

Entering edit mode

15.1 years ago

lexnederbragt ★ 1.3k

In this PDF:

http://www.jgi.doe.gov/News/primer/primer09fall.pdf

on page 2, there is a program mentioned to close gaps in 454 assemblies. We tried it out on a bacterial genome, and it seems to work for a subset of the gaps in the scaffolds. We are currently quality checking the closed gaps...

ADD COMMENT • link 15.1 years ago by lexnederbragt ★ 1.3k

0

Entering edit mode

Thanks. That looks useful. How are you quality checking the gaps?

ADD REPLY • link 15.1 years ago by Michael Barton ★ 1.9k

0

Entering edit mode

If you really must know :-) we have early access to the graph viewer, and use that to check which contigs (according to the graph) could (should) fit in the gap and align their sequences to the proposed gap-closing sequence. In addition, we did some gap-closing PCRs before and check with their sequence. Finally, we are considering checking a bunch of them with new PCRs.

ADD REPLY • link 15.1 years ago by lexnederbragt ★ 1.3k

score 2 · Answer 4 · 2010-06-09

There's also an approach for generating gap spanning contigs by aligning sequences at the contig ends and performing local assemblies.

http://genomebiology.com/2010/11/4/R41

"Advances in sequencing technology allow genomes to be sequenced at vastly decreased costs. However, the assembled data frequently are highly fragmented with many gaps. We present a practical approach that uses Illumina sequences to improve draft genome assemblies by aligning sequences against contig ends and performing local assemblies to produce gap-spanning contigs. The continuity of a draft genome can thus be substantially improved, often without the need to generate new data."