Question: Is there any reason not to run RS_ReadsOfInsert?
0
gravatar for conrad.stack
2.2 years ago by
conrad.stack0 wrote:

I've included some background info after the questions, which are first in cases of TL;DR

Questions:

  1. Is it generally a best-practice to run CircularConsensus on SMRT cell DNAseq data before doing an analysis such as scaffolding or genome assembly?

  2. Under what circumstances would you not run CircularConsensus?

(I posted these same questions on seqanswers)


Background

Our collaborators sent us 9.4Gbps (12 SMRT cells) of plant DNA sequencing from RSII (P6C4, I think). We estimate this represents ~20x coverage of the plant's genome. All of the initial processing was done by the collaborators. Their last step was the filtering of subreads, which have a post-filter N50 of around 8,000.

My Goal

I am trying to use the reads to gap-fill and do additional scaffolding of a draft genome assembly.

Results

I ran PBjelly using uncorrected subreads providing the different Analysis_results directories for each SMRT cell. The results seem very good - about half of the gaps were filled and the scaffold N50 increased by 20%.

But I suspect that some of the filled gaps, especially those in repetitive areas, are not correct. When I look at the subread placement over each gap (produced by PBjelly), I noticed that some were filled, for example, by a minor proportion (N=2) the total subreads (N=9) from a ZMW. There were a few instances like this. It occurred to me that maybe it was a mistake to use the subreads rather than consensus sequences.

pacbio dna-seq rsii ccs genome • 846 views
ADD COMMENTlink modified 2.2 years ago by Charles Warden7.0k • written 2.2 years ago by conrad.stack0

Cross-posted here.

ADD REPLYlink written 2.2 years ago by Brian Bushnell16k
1
gravatar for Charles Warden
2.2 years ago by
Charles Warden7.0k
Duarte, CA
Charles Warden7.0k wrote:

Depends upon the library size - I believe the default setting is 3 cycles, and using 5 or 10 cycles is a little better if defining CCS reads. However, I typically don't see CCS reads greater than a few kb in length.

For genome assemblies, especially if you have 10+ kb segments of repetitive elements, you can probably produce larger assemblies if starting with subreads (assembly algorithms like Canu or the Celera Assembler have a self-correction step, but this is different than CCS reads, which all come from the same ZMW).

That said, if you have a plausible assembly in hand, and you were able to define a large number of CCS reads, you could test using more traditional analysis strategies (such as a BWA alignment with variant and/or structural variant calling) to see if they identify any potential modifications to assembly. However, you can also use Quiver for polishing, even without CCS reads.

ADD COMMENTlink written 2.2 years ago by Charles Warden7.0k
1

Generating reads of insert (with no minimum number of passes) should be the most efficient and accurate way to correct reads without risk of correcting to an inexact repeat, or requiring an expensive all-to-all alignment. I see major advantages (increases the quality of data while reducing the volume of redundant data, thus diminishing the computational cost of working with PacBio data) with no downsides, at least, theoretically.

ADD REPLYlink written 2.2 years ago by Brian Bushnell16k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1562 users visited in the last hour