Question

Is there any reason not to run RS_ReadsOfInsert?

0

Entering edit mode

8.4 years ago

conrad.stack • 0

I've included some background info after the questions, which are first in cases of TL;DR

Questions:

Is it generally a best-practice to run CircularConsensus on SMRT cell DNAseq data before doing an analysis such as scaffolding or genome assembly?
Under what circumstances would you not run CircularConsensus?

(I posted these same questions on seqanswers)

Background

Our collaborators sent us 9.4Gbps (12 SMRT cells) of plant DNA sequencing from RSII (P6C4, I think). We estimate this represents ~20x coverage of the plant's genome. All of the initial processing was done by the collaborators. Their last step was the filtering of subreads, which have a post-filter N50 of around 8,000.

My Goal

I am trying to use the reads to gap-fill and do additional scaffolding of a draft genome assembly.

Results

I ran PBjelly using uncorrected subreads providing the different Analysis_results directories for each SMRT cell. The results seem very good - about half of the gaps were filled and the scaffold N50 increased by 20%.

But I suspect that some of the filled gaps, especially those in repetitive areas, are not correct. When I look at the subread placement over each gap (produced by PBjelly), I noticed that some were filled, for example, by a minor proportion (N=2) the total subreads (N=9) from a ZMW. There were a few instances like this. It occurred to me that maybe it was a mistake to use the subreads rather than consensus sequences.

genome pacbio ccs rsii DNA-seq • 2.1k views

ADD COMMENT • link updated 8.4 years ago by Charles Warden 8.3k • written 8.4 years ago by conrad.stack • 0

0

Entering edit mode

Cross-posted here.

ADD REPLY • link 8.4 years ago by Brian Bushnell 20k

score 1 · Answer 1 · 2017-05-03

Depends upon the library size - I believe the default setting is 3 cycles, and using 5 or 10 cycles is a little better if defining CCS reads. However, I typically don't see CCS reads greater than a few kb in length.

For genome assemblies, especially if you have 10+ kb segments of repetitive elements, you can probably produce larger assemblies if starting with subreads (assembly algorithms like Canu or the Celera Assembler have a self-correction step, but this is different than CCS reads, which all come from the same ZMW).

That said, if you have a plausible assembly in hand, and you were able to define a large number of CCS reads, you could test using more traditional analysis strategies (such as a BWA alignment with variant and/or structural variant calling) to see if they identify any potential modifications to assembly. However, you can also use Quiver for polishing, even without CCS reads.