Question: Pipeline for hybrid genome assembly of long and short reads AND how to deal with PacBio sequel data
1
gravatar for freddiejung
9 months ago by
freddiejung20
Japan
freddiejung20 wrote:

Dear all,

I am involved in the genome project of a insect species. The genome size of our species is estimated to be 500Mb and I have 100x illumina short read data and 50x Pacbio Sequel long read data.

I have two questions:

  1. I’m going to take hybrid assembly strategy. I thought ALPACA pipeline (https://github.com/VicugnaPacos/ALPACA) suits our situation.

I, however, realized that ALPACA uses ALLPATHS-LG inside but we don’t have the fragment library for ALLPATHS-LG.

Is there any better alternative pipeline?

  1. I am totally new to the PacBio data. Can I directly use subreads.bam files for assembly? Or do I have to take quality control steps?

Any comments would be appreciated.

ADD COMMENTlink modified 9 months ago by Carambakaracho640 • written 9 months ago by freddiejung20
2

With 50x PacBio Sequel data, you could try miniasm (https://github.com/lh3/miniasm) then use racon (https://github.com/isovic/racon) for consensus and then perhaps pilon (https://github.com/broadinstitute/pilon) for error correcting with the Illumina short reads mapped to the racon assembly.

As far as I know, you should be able to use the subreads directly. Just note that all subread bases are given a Phred quality score of 0, but racon uses the flag -q -1 in this scenario.

ADD REPLYlink modified 9 months ago • written 9 months ago by jean.elbers470

Dear jean.elbers,

Sorry for late reply and thank you for useful information. miniasm-racon-pilon pipeline looks good. I will give it a try.

ADD REPLYlink written 9 months ago by freddiejung20

One of the first step in your pipeline could be : LoRDEC

ADD REPLYlink written 9 months ago by erwan.scaon590

Dear erwan.scaon,

Thank you for information. Is that better to correct error in long-reads BEFORE assembly than AFTER assembly?

ADD REPLYlink written 9 months ago by freddiejung20

Since you will use pacbio reads to make the assembly, it is better to correct them using short reads (with quite low error rate) and run assembly after

ADD REPLYlink written 9 months ago by kolchenko0
1
gravatar for Carambakaracho
9 months ago by
Switzerland
Carambakaracho640 wrote:

Hi freddiejung,

I don't know the ALPACA pipeline, but it seems to be built on the (unmaintained) Celera assembler, a short read assembler. Such "hybrid"assembly strategies typically uses the Illumina PE library for contig assembly and the PacBio reads for subsequent scaffolding. This is a good fit for low coverage PacBio, but I believe your 50x PacBio gives you more options.

For true hybrid assembly my current favorite is MaSuRCA which has made some impressive assemblies even on highly repetitive plant genomes and performed really well in all cases I used it.

Based on you coverage, alternative strategies might involve Canu, a Celera fork for PacBio/Nanopore reads or PacBio's HGAP.4 pipeline in their SMRT Analysis software. In both cases I recommend to use your Ilumina reads for polishing of the assembly with a software like PILON.

The answer to your second question depends on the assembler, but most often you can use the unfiltered subreads.

Cheers

ADD COMMENTlink written 9 months ago by Carambakaracho640

Dear Carambakaracho, Thank you for detailed information.

Actually, I used HGAP4 using only PacBio long-reads. It gave me the assembly with N50 = 22Mb and I thought it was better than I expected.

Now I'm very excited about how much the assemble will be improved by using PILON like polishing software!

ADD REPLYlink written 9 months ago by freddiejung20

don't be surprised if the N50 won't change. in my experience the vast majority of improvements will be corrections in stretches of homopolymer repeats and a few improvements where PacBio coverage was low. 22 Mb sounds quite good already. In case you want to experiment some more, I can only recommend MaSuRCA, I impressed a few colleagues by generating really good assemblies from a multiplexed low coverage PacBio run combined with older Illumina libraries.

ADD REPLYlink written 9 months ago by Carambakaracho640
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 825 users visited in the last hour