I am involved in the genome project of a insect species. The genome size of our species is estimated to be 500Mb and I have 100x illumina short read data and 50x Pacbio Sequel long read data.
I have two questions:
- I’m going to take hybrid assembly strategy. I thought ALPACA pipeline (https://github.com/VicugnaPacos/ALPACA) suits our situation.
I, however, realized that ALPACA uses ALLPATHS-LG inside but we don’t have the fragment library for ALLPATHS-LG.
Is there any better alternative pipeline?
- I am totally new to the PacBio data. Can I directly use subreads.bam files for assembly? Or do I have to take quality control steps?
Any comments would be appreciated.
With 50x PacBio Sequel data, you could try miniasm (https://github.com/lh3/miniasm) then use racon (https://github.com/isovic/racon) for consensus and then perhaps pilon (https://github.com/broadinstitute/pilon) for error correcting with the Illumina short reads mapped to the racon assembly.
As far as I know, you should be able to use the subreads directly. Just note that all subread bases are given a Phred quality score of 0, but racon uses the flag
-q -1in this scenario.
Sorry for late reply and thank you for useful information. miniasm-racon-pilon pipeline looks good. I will give it a try.
One of the first step in your pipeline could be : LoRDEC
Thank you for information. Is that better to correct error in long-reads BEFORE assembly than AFTER assembly?
Since you will use pacbio reads to make the assembly, it is better to correct them using short reads (with quite low error rate) and run assembly after