Genome Assembly Using 2 Insert Size Libraries
2
6
Entering edit mode
12.8 years ago
Leszek 4.2k

Hi,

I want to assembly ~20Mb fungal genome. We sequenced (illumina, pair-end, 45bp) 2 insert-size libraries: 300bp and 600bp.

I've been using Velvet so far (1.1.04). There is no much improvement between assembly created using only one libraby as compared to assembly using both libraries:

library   n50    coverage contigs
300bp    4,314    27.762      4916
both     9,096    79.890      3122

I've tried ABySS as well, but number of contigs >1kb is over 8000. Could you recommend me other assemblers that perform better with multiple insert size libraries?

UPDATE:
I've tried SOAPdenovo - it seems outperform Velvet in case of 2 insert size libraries. Interestingly, Velvet outperforms SOAPdenovo in case of other fungi (very closely related though), but for those we have only one library. All data generated with k=39 (until stated otherwise). K=39 was found to perform the best accordingly to VelvetOptimiser.

#fn               contigs         bases      GC [%]        Ns (%s)           N50           N90
SOAP/pe_300only      4060    18918810    61.162     522787  (2%)    12422.45     6149.31
SOAP/pe_600only      4101    18956824    60.876    5499903 (29%)    12141.85     6058.75
velvet/pe_300only    4916    18755580    61.156     492899  (2%)     8488.68     4787.55
velvet/pe_600only    2379    17849192    60.189    2492483 (13%)    32829.79    12311.47
SOAP/pe_both_k33     2405    20326354    61.483    2835716 (13%)    26706.45    12975.35
SOAP/pe_both_k37     2280    20412482    61.328    1987084  (9%)    28706.97    13993.94
SOAP/pe_both_k39     2271    20809661    61.318    1408958  (6%)    29732.93    14409.50
SOAP/pe_both_k41     2275    20493113    61.295    1777292  (8%)    29447.59    14232.10
velvet/pe_both       3122    19461165    60.702     944011  (4%)    19010.44     9070.62
velvet/single       57772     20825488     60.981        5670  (0%)      1129.67       480.99

What is striking, Velvet gives small number of contigs (and high n50 & n90) with 600-bp library alone! Unfortunately, assembly misses quite a lot of data (18Mb as compared to 20Mb in others).
On the other hand, SOAPdenovo with both libraries gives the best assembly: 2271 contigs, n50 ~30kb and only 6% of Ns.

Do you have any comments on that?

assembly paired illumina • 5.5k views
ADD COMMENT
1
Entering edit mode

SOAP/pe_both_k39 looks best in your result. It uses both libs, and it is the best among the different K-mers you picked for SOAPdenovo. Try different K-mers on velvet as well, using both libs.

ADD REPLY
0
Entering edit mode

I've tried K in range 31-61 with 2 increment, k39 seems to be the best for velvet as well (accordingly to VelvetOptimiser for the best n50 and Lbp as well)

ADD REPLY
6
Entering edit mode
12.8 years ago
Benm ▴ 710

In you stat. info, the situation using both libraries has a double times n50 larger than using only one library, and the contigs number decreased, it seems good performance. Genome assembly , there would be a saturation value of coverage for each libraries, when you try to calculate the n50, whole length and contigs number, the accumulated curve will tell you this.

BTW, as you ask the other assemblers for multiple insert size libraries, I recommend you try these:

  • SOAPdenovo (Current version 1.05, 14-02-2011, 2.0 will be released soon which has been announced) - SOAPdenovo is a novel short-read assembly method that can build a de novo draft assembly for the human-sized genomes. The program is specially designed to assemble Illumina GA short reads. It creates new opportunities for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost effective way.

  • ALLPATH (v2.2, 2-5-2010) - ALLPATHS is the predecessor of ALLPATHS-LG. It works on ~30 base reads.

  • Ray (1.6.0, 6-13-2011) - Ray is a parallel software that computes de novo genome assemblies with next-generation sequencing data.-

And I personally prefer SOAPdenovo for illumina short and different insert size PE/MP libraries.--

ADD COMMENT
0
Entering edit mode

Have to agree with this (although in our case SOAPdenovo isn't doing a very good job on our assemblies but it is due to other issues). Depending on coverage and number of reads the 300bp insert library may have generated enough data for a reasonable assembly by itself, and so adding the 600bp insert library won't necessarily net you huge gains. But the gains you have gotten seem pretty good.

ADD REPLY
0
Entering edit mode

+1 for SOAPdenovo:)

ADD REPLY
2
Entering edit mode
12.8 years ago

I am impressed with the improvement you are seeing, because it's not like the 600bp library is going to span a different class of repeats than 300bp library. What is the result of running the 600 alone?

I would continue to explore the Velvet parameter space, as well as turn off pairing to see how much that is being leveraged.

ADD COMMENT
0
Entering edit mode

running 600bp gives quite good assembly! of course it's quite gappy and misses ~2Mb/10% of genome (as compared to biggest assemblies)

ADD REPLY
0
Entering edit mode

i am not familiar with 2x45bp - did you hard trim 76bp reads? if so try quality trimming instead using the ucdavis pair-safe trimmer

ADD REPLY
0
Entering edit mode

no, I have trimmed @ quality <20 and discarded reads <31bp. 600bp library is 2x46bp (HiSeq), 300bp library is 76bp (GAII)

ADD REPLY

Login before adding your answer.

Traffic: 2725 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6