Has anyone seen a bimodal (two distinct peaks about ~200bp apart) or very wide insert length distribution (s.d. of ~100 bp) in their paired-end data?
We believe that there is a fault at our sequencing provider, but this data is already 6 months late and have proceeded to try and de novo assemble this data.
We have been using velvet, but setting the insertlen and insertlensd options to auto, or providing these parameters determined from mapping with BWA, results in a large number of N characters in the scaffolded contigs. If we use a very tight insertlen_sd (10% of the mean) we can eliminate if not all of the N characters. However we loose a lot of the data that falls outside of the defined region.
Has anyone tried assembling such data with velvet? Does anyone have any suggestions of things to try? Could someone suggest an assembly program/algorithm that may better handle such mentioned data?
Do you have a close or similar reference genome you can use?
It sounds like very poor fragment library construction. You should be able to get it re-done for free, however the fact it is so late suggests that probably won't happen.
Try just assembling as single end reads "velveth -short" and see what sort of contigs you get. Then align your reads back and plot the insert size distribution.
Hi Torst - these come from a variety of bacteria and in some cases we are lucky to have published references. Our latest data seems to be of similar fate which is worrying. I'll see what we get when not considering the paired library information.