4 weeks ago by
The question to ask is the type of variation you are expecting to encounter; a non-systematic approach to be sure, but there is not a one size fits all solution. In my personal experience (TruSeq/Nextera Illumina several paired end types de novo and reference based, PacBio RSII p4c2/p5c3/p6c4 de novo, Nanopore R7/R9.4 de novo), I would go with the PacBio (>40x coverage, ideally >80x; 15kb-20kb fragment size) combined with Illumina (10-15x covs). PacBio alone will work for SVs so long as you are not specifically calling SNPs, as homopolymer frameshifts and read accuracy are a potential confounder (not insurmountable, just something to be aware of).
As Roxane said, there are numerous advantages to long reads for structural variation (also see this paper). That said they are noisy reads, so 10x depth will not be enough for majority SNP basecalls, hence the typical hybrid sequencing route. Comparatively, Illumina cannot call the same types of SVs without additional technologies (combining w/ mate pair runs, as in human genomics), however I personally am skeptical of its ability to call things other than large scale insertions, deletions and inversions. Tandem insertions of multiple copies of a gene (as in the ultra-long nanopore read paper above) would fall under this category, and could be relevant to an allopolyploid with differential gene copy number in said pathway.
With PacBio, you can choose your fragment size, and overcome particular obstacles. There was a recent PR blurb by PacBio for Sequel 6.0, basically that their circular consensus reads (CCS) could achieve >99% accuracy in high throughput, the drawback being that is on 1kb and 2kb fragments, I imagine the result of five passes or more of the target sequence in each CCS. With the RSII--even without CCS reads--the majority of basecalling errors we saw were stochastic, so higher depth coverage (>80x) was sufficient for consensus even with homopolymers of 3-6 bases, something that typically stymied 454 sequencing. Larger homopolymers will still give you frameshift trouble, though I haven't tried the Sequel yet, so I don't know if they have gotten better at resolving those.
The last time I spoke with a PacBio rep, their suggestion was to run a library of 2kb for CCS, and a library of 15-20kb for scaffolding (taken with a grain of 'please buy my product' salt). If you have cheap Illumina available to you though, that is still the better option to check individual base accuracy and check homopolymers, though Illumina itself is not immune to that issue. In our experience with bacterial genomes, the PacBio 15k long reads at 80x was sufficient to no longer need the Illumina for checking. Given eukaryotic genome complexity, and the size of typical plant genomes, the hybrid approach is still you best bet. However you should probably scale up the PacBio long read coverage.