Pipeline for hybrid genome assembly of long and short reads AND how to deal with PacBio sequel data
1
2
Entering edit mode
4.9 years ago
freddiejung ▴ 60

Dear all,

I am involved in the genome project of a insect species. The genome size of our species is estimated to be 500Mb and I have 100x illumina short read data and 50x Pacbio Sequel long read data.

I have two questions:

1. I’m going to take hybrid assembly strategy. I thought ALPACA pipeline (https://github.com/VicugnaPacos/ALPACA) suits our situation.

I, however, realized that ALPACA uses ALLPATHS-LG inside but we don’t have the fragment library for ALLPATHS-LG.

Is there any better alternative pipeline?

1. I am totally new to the PacBio data. Can I directly use subreads.bam files for assembly? Or do I have to take quality control steps?

Genome Hybrid Assembly PacBio illumina Assembly • 4.2k views
2
Entering edit mode

With 50x PacBio Sequel data, you could try miniasm (https://github.com/lh3/miniasm) then use racon (https://github.com/isovic/racon) for consensus and then perhaps pilon (https://github.com/broadinstitute/pilon) for error correcting with the Illumina short reads mapped to the racon assembly.

As far as I know, you should be able to use the subreads directly. Just note that all subread bases are given a Phred quality score of 0, but racon uses the flag -q -1 in this scenario.

0
Entering edit mode

Dear jean.elbers,

Sorry for late reply and thank you for useful information. miniasm-racon-pilon pipeline looks good. I will give it a try.

0
Entering edit mode

One of the first step in your pipeline could be : LoRDEC

0
Entering edit mode

Dear erwan.scaon,

Thank you for information. Is that better to correct error in long-reads BEFORE assembly than AFTER assembly?

0
Entering edit mode

Since you will use pacbio reads to make the assembly, it is better to correct them using short reads (with quite low error rate) and run assembly after

1
Entering edit mode
4.9 years ago
Carambakaracho ★ 3.1k

Hi freddiejung,

I don't know the ALPACA pipeline, but it seems to be built on the (unmaintained) Celera assembler, a short read assembler. Such "hybrid"assembly strategies typically uses the Illumina PE library for contig assembly and the PacBio reads for subsequent scaffolding. This is a good fit for low coverage PacBio, but I believe your 50x PacBio gives you more options.

For true hybrid assembly my current favorite is MaSuRCA which has made some impressive assemblies even on highly repetitive plant genomes and performed really well in all cases I used it.

Based on you coverage, alternative strategies might involve Canu, a Celera fork for PacBio/Nanopore reads or PacBio's HGAP.4 pipeline in their SMRT Analysis software. In both cases I recommend to use your Ilumina reads for polishing of the assembly with a software like PILON.

The answer to your second question depends on the assembler, but most often you can use the unfiltered subreads.

Cheers

0
Entering edit mode

Dear Carambakaracho, Thank you for detailed information.

Actually, I used HGAP4 using only PacBio long-reads. It gave me the assembly with N50 = 22Mb and I thought it was better than I expected.

Now I'm very excited about how much the assemble will be improved by using PILON like polishing software!

0
Entering edit mode

don't be surprised if the N50 won't change. in my experience the vast majority of improvements will be corrections in stretches of homopolymer repeats and a few improvements where PacBio coverage was low. 22 Mb sounds quite good already. In case you want to experiment some more, I can only recommend MaSuRCA, I impressed a few colleagues by generating really good assemblies from a multiplexed low coverage PacBio run combined with older Illumina libraries.

0
Entering edit mode

Hi, How long did the HGAP4 assembler take and what was the size of your genome?