I've included some background info after the questions, which are first in cases of TL;DR
Questions:
Is it generally a best-practice to run
CircularConsensuson SMRT cell DNAseq data before doing an analysis such as scaffolding or genome assembly?Under what circumstances would you not run
CircularConsensus?
(I posted these same questions on seqanswers)
Background
Our collaborators sent us 9.4Gbps (12 SMRT cells) of plant DNA sequencing from RSII (P6C4, I think). We estimate this represents ~20x coverage of the plant's genome. All of the initial processing was done by the collaborators. Their last step was the filtering of subreads, which have a post-filter N50 of around 8,000.
My Goal
I am trying to use the reads to gap-fill and do additional scaffolding of a draft genome assembly.
Results
I ran PBjelly using uncorrected subreads providing the different Analysis_results directories for each SMRT cell. The results seem very good - about half of the gaps were filled and the scaffold N50 increased by 20%.
But I suspect that some of the filled gaps, especially those in repetitive areas, are not correct. When I look at the subread placement over each gap (produced by PBjelly), I noticed that some were filled, for example, by a minor proportion (N=2) the total subreads (N=9) from a ZMW. There were a few instances like this. It occurred to me that maybe it was a mistake to use the subreads rather than consensus sequences.
Cross-posted here.