Question

What tools you use or know for PacBio Long Read error correction?

12

Entering edit mode

9.3 years ago

Medhat 9.7k

Hi,

What tools you use or know for PacBio Long Read error correction and why? or what is its pros and cons?

sequence software sequencing Assembly • 16k views

ADD COMMENT • link updated 3.0 years ago by Ram 43k • written 9.3 years ago by Medhat 9.7k

0

Entering edit mode

other algorithms for long reads errors correction?

ADD REPLY • link updated 3.0 years ago by Ram 43k • written 9.2 years ago by midox ▴ 290

0

Entering edit mode

what's chimeric positions?

thank you

ADD REPLY • link updated 3.0 years ago by Ram 43k • written 9.2 years ago by midox ▴ 290

4

Entering edit mode

PacBio reads can be chimeras - meaning a fusion of sequences that don't occur in that order in the sequenced sample. This can either happen, if subreads are not split properly (--subread--adapter--rev-comp-subread--) or during library preparation by random ligation of fragments. Chimeric positions should indicate such breakpoints in a read

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 9.2 years ago by thackl ★ 3.0k

0

Entering edit mode

excuse me I did not understand the definition of "chimeric reads".

Is there a clear definition?

thank you

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 9.2 years ago by midox ▴ 290

2

Entering edit mode

http://drive5.com/usearch/manual/chimera_formation.html

ADD REPLY • link updated 3.0 years ago by Ram 43k • written 9.2 years ago by thackl ★ 3.0k

0

Entering edit mode

The flag you mentioned: --subread--adapter--rev-comp-subread--

What tool is that for? PacBio's consensus caller?

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by Lynxoid ▴ 230

2

Entering edit mode

What are chimeric reads?

ADD REPLY • link 9.2 years ago by Medhat 9.7k

Ram · Answer 1 · 2015-02-02

11

Entering edit mode

9.2 years ago

thackl ★ 3.0k

Check out proovread. It ..

.. outperforms PacBioToCA/LSC in terms of accuracy and contiguity/sensitivity
.. is easy to install/run/configure
.. supports various types of dat
- HiSeq/MiSeq 100-500bp)
- **Unitigs
- 454, ...

proovread maps high coverage data to pacbio reads (bwa mem, blasr, daligner) in multiple iterations. This is not the fastest (if speed is your concern, go for LoRDEC http://www.atgc-montpellier.fr/lordec/) but the most thorough approach, giving you the most out of your PacBio data. You find some comparative stats here.

ADD COMMENT • link updated 3.0 years ago by Ram 43k • written 9.2 years ago by thackl ★ 3.0k

3

Entering edit mode

I used proovread to correct PacBio cDNA reads. It worked out of the box

I tried PacBioToCA and LSC, both were way too slow in my settings.

For me it was also important to get the corrected but untrimmed reads to retain full length transcripts. PacBioToCA for instance did not provide this option.

When correcting with Illumina RNA-seq short read data it is also helpful to normalize the data first to further speed up the correction. I used normalize-by-median.py of the khmer package.

Works well with proovread, since proovread uses a coverage cutoff anyway and since it prioritizes reads mapping with fewer mismatches.

ADD REPLY • link updated 3.0 years ago by Ram 43k • written 9.2 years ago by AndreM ▴ 30

0

Entering edit mode

Just for information, the hyperlink for LoRDEC in your post does not seem to work (404 not found).

ADD REPLY • link updated 3.0 years ago by Ram 43k • written 8.4 years ago by edrezen ▴ 730

0

Entering edit mode

it can be visited here: http://www.atgc-montpellier.fr/lordec/

ADD REPLY • link updated 3.0 years ago by Ram 43k • written 8.4 years ago by pengchy ▴ 450

Ram · Answer 2 · 2015-01-26

4

Entering edit mode

9.2 years ago

Charles Warden 8.2k

I've used PBcR in the Celera Assembler package.

It works for hybrid assembly as well as just PacBio assembly (actually, I found self-correction worked better than hybrid correction with the dataset that I worked with)

If you haven't seen them already, I would also recommend viewing the tutorials on the PacBio website.

I think there is at least 1-2 talks that review methods for de novo assembly and read correction.

ADD COMMENT • link updated 3.0 years ago by Ram 43k • written 9.2 years ago by Charles Warden 8.2k

1

Entering edit mode

I saw the tutorial but is there any other tools?

ADD REPLY • link updated 3.0 years ago by Ram 43k • written 9.2 years ago by Medhat 9.7k

2

Entering edit mode

Off the top of my head, I don't recall which assembly tools specifically have an error correction step.

Some de novo assembly tools that I recall include HGAP and MIRA. I think the computer associated with the sequencer should have a de novo assembly algorithm, which I think is HGAP. I think MIRA does an error correction, but it only works with for hybrid-correction with Illumina reads.

Quiver can also be used to polish assemblies (so, correct errors post-assembly rather than pre-assembly): https://github.com/PacificBiosciences/GenomicConsensus/blob/master/doc/HowToQuiver.rst

In fact, the HGAP link appears to recommend using Quiver as part of the assembly pipeline.

ADD REPLY • link updated 3.0 years ago by Ram 43k • written 9.2 years ago by Charles Warden 8.2k

Ram · Answer 3 · 2015-01-26

4

Entering edit mode

9.2 years ago

midox ▴ 290

There is "LoRDEC: accurate and efficient long read error correction", it uses DBG short reads to correct erroneous parts in Long reads.

It's a new program for correcting Long rads.

ADD COMMENT • link updated 3.0 years ago by Ram 43k • written 9.2 years ago by midox ▴ 290

0

Entering edit mode

Note also that, as a side effect, LoRDEC generates an output HDF5 file holding the de Bruijn graph of the short reads. This HDF5 file can be the input for other tools on the short reads (de novo assembly with minia, SNP detection with discoSNP, ... see here

ADD REPLY • link updated 3.0 years ago by Ram 43k • written 9.2 years ago by edrezen ▴ 730

Ram · Answer 4 · 2015-01-29

3

Entering edit mode

9.2 years ago

Sean R Johnson ▴ 120

ECTools is the one that has worked best for me. It's written for a particular kind of grid computing system though, so you may have to modify step 8 from their tutorial to suite your particular environment.

For running on a single server (which will be pretty slow, but this is just an example of how to wrap their scripts for a different scheduling system) I used the following bash script instead of steps 8 and 9

#! /bin/bash
export TMPDIR=/a/directory/for/temporary_files
mkdir -p $TMPDIR
THREADS=12
NUM_PARTITIONS=0213 # should be a 4 character wide integer left-padded with zeros
NUM_FILES_PER_PARTITION=500
ORGANISM_NAME=organism_name

run_file() {
        export SGE_TASK_ID=$1
        ../correct.sh
}
export -f run_file

for i in `eval echo {0001..$NUM_PARTITIONS}` #braces evaluated before variable
do
 echo $i
        cd $i
        parallel -j $THREADS run_file ::: `eval echo {1..$NUM_FILES_PER_PARTITION}`
        cd ..
done
cat ????/*.cor.fa > ${ORGANISM_NAME}.cor.fa

ADD COMMENT • link updated 3.0 years ago by Ram 43k • written 9.2 years ago by Sean R Johnson ▴ 120

0

Entering edit mode

ECtools performs error correction of long reads?

ADD REPLY • link updated 3.0 years ago by Ram 43k • written 9.2 years ago by midox ▴ 290

0

Entering edit mode

Well... according to the README file: "In short, the correction algorithm takes as input the unitigs from a short read assembly and uses them to correct long read data."

So, the answer to your question is yes. Although, much like some of the tools mentioned in other answers, it relies on short reads to do so.

ADD REPLY • link updated 3.0 years ago by Ram 43k • written 9.2 years ago by Sean R Johnson ▴ 120

Ram · Answer 5 · 2015-01-28

1

Entering edit mode

9.2 years ago

Felix Francis ▴ 600

PacBioToCA (error correction via Celera Assembler) can also be used for error correcting PacBio reads using short reads including Illumina's.

Another one is LSC

ADD COMMENT • link updated 3.0 years ago by Ram 43k • written 9.2 years ago by Felix Francis ▴ 600

Ram · Answer 6 · 2015-02-02

I am very happy with proovread.

It is extremely flexible with respect to the type of Illumina data (HiSeqs, MiSeq, unitigs etc), quite fast, completely tunable and the author (Thomas Hackl) is very responsive. We have used it to correct lots of PacBios and it is extremly stable.

In contrast to ECTools, which takes much much longer and gives cluster jobs with unpredictable runtimes (depending on how many repeats the PacBio reads have), proovread jobs have a predictable runtime with little variation, which makes it easy to tailor jobs to the requirements of a compute cluster (runtimes, # cores etc). Memory usage is minimal.

Ram · Answer 7 · 2015-03-09

I used proovread recently to correct long reads by mapping short reads. For my volume I had to use HPC for one week to finnish correction. Author of this soft, Thomas, is very responsive, indeed. I also used pacbioToCA for pacbio self correction having 40x coverage, even though caveat for good performance is 50x. I didn't get satisfying results. With proovread you loose around 25% in pacbio length. With Celera in my case it was around 60%.

Ram · Answer 8 · 2015-05-08

0

Entering edit mode

9.0 years ago

pengchy ▴ 450

If we know the reference genome, why not correct the PacBio Transcriptome data using the transcript annotated from genome directly? The work will become only to align the PacBio reads to the reference transcripts. It seems easy work?

ADD COMMENT • link updated 3.0 years ago by Ram 43k • written 9.0 years ago by pengchy ▴ 450

0

Entering edit mode

You may not have a reference, for example, if you are working with new species or with a metagenomic sample.

ADD REPLY • link updated 3.0 years ago by Ram 43k • written 8.4 years ago by Lynxoid ▴ 230