What tools you use or know for PacBio Long Read error correction?
8
12
Entering edit mode
7.0 years ago
Medhat 9.0k

Hi,

What tools you use or know for PacBio Long Read error correction and why? or what is its pros and cons?

sequence software sequencing Assembly • 13k views
0
Entering edit mode

other algorithms for long reads errors correction?

0
Entering edit mode

what's chimeric positions?

thank you

4
Entering edit mode

PacBio reads can be chimeras - meaning a fusion of sequences that don't occur in that order in the sequenced sample. This can either happen, if subreads are not split properly (--subread--adapter--rev-comp-subread--) or during library preparation by random ligation of fragments. Chimeric positions should indicate such breakpoints in a read

0
Entering edit mode

excuse me I did not understand the definition of "chimeric reads".

Is there a clear definition?

thank you

2
Entering edit mode
0
Entering edit mode

The flag you mentioned: --subread--adapter--rev-comp-subread--

What tool is that for? PacBio's consensus caller?

2
Entering edit mode
11
Entering edit mode
7.0 years ago
thackl ★ 2.9k

proovread maps high coverage data to pacbio reads (bwa mem, blasr, daligner) in multiple iterations. This is not the fastest (if speed is your concern, go for LoRDEC http://www.atgc-montpellier.fr/lordec/) but the most thorough approach, giving you the most out of your PacBio data. You find some comparative stats here.

3
Entering edit mode

I used proovread to correct PacBio cDNA reads. It worked out of the box

I tried PacBioToCA and LSC, both were way too slow in my settings.

For me it was also important to get the corrected but untrimmed reads to retain full length transcripts. PacBioToCA for instance did not provide this option.

When correcting with Illumina RNA-seq short read data it is also helpful to normalize the data first to further speed up the correction. I used normalize-by-median.py of the khmer package.

Works well with proovread, since proovread uses a coverage cutoff anyway and since it prioritizes reads mapping with fewer mismatches.

0
Entering edit mode

0
Entering edit mode

it can be visited here: http://www.atgc-montpellier.fr/lordec/

4
Entering edit mode
7.0 years ago

I've used PBcR in the Celera Assembler package.

It works for hybrid assembly as well as just PacBio assembly (actually, I found self-correction worked better than hybrid correction with the dataset that I worked with)

If you haven't seen them already, I would also recommend viewing the tutorials on the PacBio website.

I think there is at least 1-2 talks that review methods for de novo assembly and read correction.

1
Entering edit mode

I saw the tutorial but is there any other tools?

2
Entering edit mode

Off the top of my head, I don't recall which assembly tools specifically have an error correction step.

Some de novo assembly tools that I recall include HGAP and MIRA. I think the computer associated with the sequencer should have a de novo assembly algorithm, which I think is HGAP. I think MIRA does an error correction, but it only works with for hybrid-correction with Illumina reads.

Quiver can also be used to polish assemblies (so, correct errors post-assembly rather than pre-assembly): https://github.com/PacificBiosciences/GenomicConsensus/blob/master/doc/HowToQuiver.rst

In fact, the HGAP link appears to recommend using Quiver as part of the assembly pipeline.

4
Entering edit mode
7.0 years ago
midox ▴ 270

There is "LoRDEC: accurate and efficient long read error correction", it uses DBG short reads to correct erroneous parts in Long reads.

It's a new program for correcting Long rads.

0
Entering edit mode

Note also that, as a side effect, LoRDEC generates an output HDF5 file holding the de Bruijn graph of the short reads. This HDF5 file can be the input for other tools on the short reads (de novo assembly with minia, SNP detection with discoSNP, ... see here

3
Entering edit mode
7.0 years ago

ECTools is the one that has worked best for me. It's written for a particular kind of grid computing system though, so you may have to modify step 8 from their tutorial to suite your particular environment.

For running on a single server (which will be pretty slow, but this is just an example of how to wrap their scripts for a different scheduling system) I used the following bash script instead of steps 8 and 9

#! /bin/bash
export TMPDIR=/a/directory/for/temporary_files
mkdir -p $TMPDIR THREADS=12 NUM_PARTITIONS=0213 # should be a 4 character wide integer left-padded with zeros NUM_FILES_PER_PARTITION=500 ORGANISM_NAME=organism_name run_file() { export SGE_TASK_ID=$1
../correct.sh
}
export -f run_file

for i in eval echo {0001..$NUM_PARTITIONS} #braces evaluated before variable do echo$i
cd $i parallel -j$THREADS run_file ::: eval echo {1..$NUM_FILES_PER_PARTITION} cd .. done cat ????/*.cor.fa >${ORGANISM_NAME}.cor.fa

0
Entering edit mode

ECtools performs error correction of long reads?

0
Entering edit mode

Well... according to the README file: "In short, the correction algorithm takes as input the unitigs from a short read assembly and uses them to correct long read data."

So, the answer to your question is yes. Although, much like some of the tools mentioned in other answers, it relies on short reads to do so.

1
Entering edit mode
7.0 years ago
Felix Francis ▴ 570

PacBioToCA (error correction via Celera Assembler) can also be used for error correcting PacBio reads using short reads including Illumina's.

Another one is LSC

1
Entering edit mode
7.0 years ago
micha.hiller ▴ 10

I am very happy with proovread.

It is extremely flexible with respect to the type of Illumina data (HiSeqs, MiSeq, unitigs etc), quite fast, completely tunable and the author (Thomas Hackl) is very responsive. We have used it to correct lots of PacBios and it is extremly stable.

In contrast to ECTools, which takes much much longer and gives cluster jobs with unpredictable runtimes (depending on how many repeats the PacBio reads have), proovread jobs have a predictable runtime with little variation, which makes it easy to tailor jobs to the requirements of a compute cluster (runtimes, # cores etc). Memory usage is minimal.

1
Entering edit mode
6.9 years ago

I used proovread recently to correct long reads by mapping short reads. For my volume I had to use HPC for one week to finnish correction. Author of this soft, Thomas, is very responsive, indeed. I also used pacbioToCA for pacbio self correction having 40x coverage, even though caveat for good performance is 50x. I didn't get satisfying results. With proovread you loose around 25% in pacbio length. With Celera in my case it was around 60%.

0
Entering edit mode
6.7 years ago
pengchy ▴ 450

If we know the reference genome, why not correct the PacBio Transcriptome data using the transcript annotated from genome directly? The work will become only to align the PacBio reads to the reference transcripts. It seems easy work?

0
Entering edit mode

You may not have a reference, for example, if you are working with new species or with a metagenomic sample.