Illumina polishing after PacBio assembly and polishing?
1
0
Entering edit mode
4.9 years ago
ikangkim ▴ 50

Hi,

If genome sequences were assembled from PacBio reads and then polished with quiver/arrow iteratively, do I steel need to polish the genome sequences further with Illumina reads?

I've been trying to assemble a few dozen actinobacterial genomes. I have both PacBio and Illumina data and sequencing depths are >100X for both data for all genomes.

I performed hybrid assembly using Unicycler and SPAdes and long-read assembly using Canu and Flye, for all genomes. Then, I compared the assembly results. I found that long-read assemblies were more contiguous than hybrid assemblies for many genomes. So, I chose long-read assemblies and then polished the assemblies by Arrow algorithm of PacBio GenomicConsensus package (https://github.com/PacificBiosciences/GenomicConsensus). I could get stable(?) genome sequences after 1-3 rounds of Arrow polishing. In other words, Arrow didn't correct anything after 1-3 rounds.

But, when I performed polishing of these stable(?) genomes with Illumina reads by using Pilon, Pilon introduced many changes, sometimes including insertion/deletion of >50-100 bp.

Should I believe the Pilon results in these cases? Or, am I doing unnecessary Pilon polishing?

Thanks.

Assembly genome sequencing • 3.8k views
ADD COMMENT
1
Entering edit mode
4.9 years ago
h.mon 35k

One of the main problems with PacBio (and NanoPore, currently to even a greater extent) is that it has a systematic deletion error at homopolymer regions. As these errors are systematic (which means, they tend to occur at the same regions), polishing with PacBio won't fix all of them. Illumina, on the other hand, has a different error mode (mainly random substitutions) and can successfully correct PacBio indels.

Regarding the longer insertions / deletions, I would either believe Pilon, or check the mappings at these regions with IGV or some other genome browser.

ADD COMMENT
1
Entering edit mode

Thank you for a comment.

(1) I didn't know that PacBio has systematic deletion at homopolymer regions.

(2) Below is the IGV snapshot image of the region where Pilon introduced 43 bp deletion. Upper is the PacBio read mapping, while the lower is the Illumina read mapping. All PacBio reads were mapped to this region with ~40-45 bp insertion, and some of Illumina reads also showed insertion. PacBio coverages were even across this region, while Illumina mapping showed a noticeable peak. In my opinion, Pilon might have introduced an erroneous long deletion at this region possibly due to wrong mapping. Am I right?

IGV snapshot

ADD REPLY
0
Entering edit mode

(1) I didn't know that PacBio has systematic deletion at homopolymer regions.

Actually as I understand it the errors for PacBio CLR tend to random and not systematic, but the error rates are higher than ONP and are primarily indels. Newer R10 data from Oxford seems to be largely addressing systematic errors with their data, though I suspect with longer homopolymers (8-10 or longer) this may still be an issue

(2) Below is the IGV snapshot image of the region where Pilon introduced 43 bp deletion. Upper is the PacBio read mapping, while the lower is the Illumina read mapping. All PacBio reads were mapped to this region with ~40-45 bp insertion, and some of Illumina reads also showed insertion. PacBio coverages were even across this region, while Illumina mapping showed a noticeable peak. In my opinion, Pilon might have introduced an erroneous long deletion at this region possibly due to wrong mapping. Am I right?

It can sometimes do this, yes, if you run by default more or allow all error correction steps in Pilon like local reassembly. However this may also be something that is sample-dependent; the region you show here where Illumina coverage is higher is actually quite small (hard to read but maybe 10-12bp?), but isn't exactly double the depth if this were a deletion in the entire population. I have seen instances in microbial assemblies like this that appear to be 'hot spots' for indels, which I suspect may be in a subset of the bacteria in the sample.

I normally aim on the conservative side. I iteratively run Pilon adjusted for ploidy, checking the # of correction in the VCF returned, and turn off any option when running Pilon that allows for larger changes, trying relegating correction to small indels and SNPs.

ADD REPLY
1
Entering edit mode

Thank you for a comment. I've never thought about 'hot spot' for indels.

Nowadays, I'm checking the Pilon results more carefully, and sometimes run Pilon with "--fix bases" option to disallow large changes.

Thanks.

ADD REPLY

Login before adding your answer.

Traffic: 1543 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6