Question

Pipeline for hybrid genome assembly of long and short reads AND how to deal with PacBio sequel data

2

Entering edit mode

6.4 years ago

freddiejung ▴ 60

Dear all,

I am involved in the genome project of a insect species. The genome size of our species is estimated to be 500Mb and I have 100x illumina short read data and 50x Pacbio Sequel long read data.

I have two questions:

I’m going to take hybrid assembly strategy. I thought ALPACA pipeline (https://github.com/VicugnaPacos/ALPACA) suits our situation.

I, however, realized that ALPACA uses ALLPATHS-LG inside but we don’t have the fragment library for ALLPATHS-LG.

Is there any better alternative pipeline?

I am totally new to the PacBio data. Can I directly use subreads.bam files for assembly? Or do I have to take quality control steps?

Any comments would be appreciated.

Genome Hybrid Assembly PacBio illumina Assembly • 5.0k views

ADD COMMENT • link updated 6.4 years ago by Carambakaracho ★ 3.3k • written 6.4 years ago by freddiejung ▴ 60

2

Entering edit mode

With 50x PacBio Sequel data, you could try miniasm (https://github.com/lh3/miniasm) then use racon (https://github.com/isovic/racon) for consensus and then perhaps pilon (https://github.com/broadinstitute/pilon) for error correcting with the Illumina short reads mapped to the racon assembly.

As far as I know, you should be able to use the subreads directly. Just note that all subread bases are given a Phred quality score of 0, but racon uses the flag -q -1 in this scenario.

ADD REPLY • link 6.4 years ago by jean.elbers ★ 1.7k

0

Entering edit mode

Dear jean.elbers,

Sorry for late reply and thank you for useful information. miniasm-racon-pilon pipeline looks good. I will give it a try.

ADD REPLY • link 6.4 years ago by freddiejung ▴ 60

0

Entering edit mode

One of the first step in your pipeline could be : LoRDEC

ADD REPLY • link 6.4 years ago by erwan.scaon ▴ 950

0

Entering edit mode

Dear erwan.scaon,

Thank you for information. Is that better to correct error in long-reads BEFORE assembly than AFTER assembly?

ADD REPLY • link 6.4 years ago by freddiejung ▴ 60

0

Entering edit mode

Since you will use pacbio reads to make the assembly, it is better to correct them using short reads (with quite low error rate) and run assembly after

ADD REPLY • link 6.4 years ago by kolchenko • 0

score 1 · Answer 1 · 2018-03-14

1

Entering edit mode

6.4 years ago

Carambakaracho ★ 3.3k

Hi freddiejung,

I don't know the ALPACA pipeline, but it seems to be built on the (unmaintained) Celera assembler, a short read assembler. Such "hybrid"assembly strategies typically uses the Illumina PE library for contig assembly and the PacBio reads for subsequent scaffolding. This is a good fit for low coverage PacBio, but I believe your 50x PacBio gives you more options.

For true hybrid assembly my current favorite is MaSuRCA which has made some impressive assemblies even on highly repetitive plant genomes and performed really well in all cases I used it.

Based on you coverage, alternative strategies might involve Canu, a Celera fork for PacBio/Nanopore reads or PacBio's HGAP.4 pipeline in their SMRT Analysis software. In both cases I recommend to use your Ilumina reads for polishing of the assembly with a software like PILON.

The answer to your second question depends on the assembler, but most often you can use the unfiltered subreads.

Cheers

ADD COMMENT • link 6.4 years ago by Carambakaracho ★ 3.3k

0

Entering edit mode

Dear Carambakaracho, Thank you for detailed information.

Actually, I used HGAP4 using only PacBio long-reads. It gave me the assembly with N50 = 22Mb and I thought it was better than I expected.

Now I'm very excited about how much the assemble will be improved by using PILON like polishing software!

ADD REPLY • link 6.4 years ago by freddiejung ▴ 60

0

Entering edit mode

don't be surprised if the N50 won't change. in my experience the vast majority of improvements will be corrections in stretches of homopolymer repeats and a few improvements where PacBio coverage was low. 22 Mb sounds quite good already. In case you want to experiment some more, I can only recommend MaSuRCA, I impressed a few colleagues by generating really good assemblies from a multiplexed low coverage PacBio run combined with older Illumina libraries.