Question

Assembly Of Illumina Paired-End Data With Wide/Bimodal Insert Distributions

1

Entering edit mode

13.0 years ago

Mitchell ▴ 40

Has anyone seen a bimodal (two distinct peaks about ~200bp apart) or very wide insert length distribution (s.d. of ~100 bp) in their paired-end data?

We believe that there is a fault at our sequencing provider, but this data is already 6 months late and have proceeded to try and de novo assemble this data.

We have been using velvet, but setting the insertlen and insertlensd options to auto, or providing these parameters determined from mapping with BWA, results in a large number of N characters in the scaffolded contigs. If we use a very tight insertlen_sd (10% of the mean) we can eliminate if not all of the N characters. However we loose a lot of the data that falls outside of the defined region.

Has anyone tried assembling such data with velvet? Does anyone have any suggestions of things to try? Could someone suggest an assembly program/algorithm that may better handle such mentioned data?

assembly velvet paired • 3.4k views

ADD COMMENT • link updated 13.0 years ago by Botond Sipos ★ 1.7k • written 13.0 years ago by Mitchell ▴ 40

0

Entering edit mode

Do you have a close or similar reference genome you can use?

ADD REPLY • link 13.0 years ago by Torst ▴ 980

0

Entering edit mode

It sounds like very poor fragment library construction. You should be able to get it re-done for free, however the fact it is so late suggests that probably won't happen.

ADD REPLY • link 13.0 years ago by Torst ▴ 980

0

Entering edit mode

Try just assembling as single end reads "velveth -short" and see what sort of contigs you get. Then align your reads back and plot the insert size distribution.

ADD REPLY • link 13.0 years ago by Torst ▴ 980

0

Entering edit mode

Hi Torst - these come from a variety of bacteria and in some cases we are lucky to have published references. Our latest data seems to be of similar fate which is worrying. I'll see what we get when not considering the paired library information.

ADD REPLY • link 13.0 years ago by Mitchell ▴ 40

score 2 · Answer 1 · 2011-12-02

2

Entering edit mode

12.9 years ago

Botond Sipos ★ 1.7k

Velvet allows for multiple categories of reads (two by default) in order to deal with these situations. Check out the "Using multiple categories" section of the Velvet manual.

ADD COMMENT • link 12.9 years ago by Botond Sipos ★ 1.7k

score 0 · Answer 2 · 2011-11-04

0

Entering edit mode

13.0 years ago

Vitis ★ 2.5k

If the distribution is bimodal, can you try dividing the data to two data sets with two insert sizes, then de novo assemble them separately? I think sam would tell you the predicted insert sizes. You can merge/combine the contigs later. Although we never tried this, we routinely assemble the same reads with different k-mers and merge the assemblies afterwards using CAP or Phrap. Usually it yields better results form a single assembly.