Is there any reason not to run RS_ReadsOfInsert?
1
0
Entering edit mode
7.0 years ago

I've included some background info after the questions, which are first in cases of TL;DR

Questions:

  1. Is it generally a best-practice to run CircularConsensus on SMRT cell DNAseq data before doing an analysis such as scaffolding or genome assembly?

  2. Under what circumstances would you not run CircularConsensus?

(I posted these same questions on seqanswers)


Background

Our collaborators sent us 9.4Gbps (12 SMRT cells) of plant DNA sequencing from RSII (P6C4, I think). We estimate this represents ~20x coverage of the plant's genome. All of the initial processing was done by the collaborators. Their last step was the filtering of subreads, which have a post-filter N50 of around 8,000.

My Goal

I am trying to use the reads to gap-fill and do additional scaffolding of a draft genome assembly.

Results

I ran PBjelly using uncorrected subreads providing the different Analysis_results directories for each SMRT cell. The results seem very good - about half of the gaps were filled and the scaffold N50 increased by 20%.

But I suspect that some of the filled gaps, especially those in repetitive areas, are not correct. When I look at the subread placement over each gap (produced by PBjelly), I noticed that some were filled, for example, by a minor proportion (N=2) the total subreads (N=9) from a ZMW. There were a few instances like this. It occurred to me that maybe it was a mistake to use the subreads rather than consensus sequences.

genome pacbio ccs rsii DNA-seq • 1.8k views
ADD COMMENT
0
Entering edit mode

Cross-posted here.

ADD REPLY
1
Entering edit mode
7.0 years ago

Depends upon the library size - I believe the default setting is 3 cycles, and using 5 or 10 cycles is a little better if defining CCS reads. However, I typically don't see CCS reads greater than a few kb in length.

For genome assemblies, especially if you have 10+ kb segments of repetitive elements, you can probably produce larger assemblies if starting with subreads (assembly algorithms like Canu or the Celera Assembler have a self-correction step, but this is different than CCS reads, which all come from the same ZMW).

That said, if you have a plausible assembly in hand, and you were able to define a large number of CCS reads, you could test using more traditional analysis strategies (such as a BWA alignment with variant and/or structural variant calling) to see if they identify any potential modifications to assembly. However, you can also use Quiver for polishing, even without CCS reads.

ADD COMMENT
1
Entering edit mode

Generating reads of insert (with no minimum number of passes) should be the most efficient and accurate way to correct reads without risk of correcting to an inexact repeat, or requiring an expensive all-to-all alignment. I see major advantages (increases the quality of data while reducing the volume of redundant data, thus diminishing the computational cost of working with PacBio data) with no downsides, at least, theoretically.

ADD REPLY

Login before adding your answer.

Traffic: 3212 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6