Question: What tools you use or know for PacBio Long Read error correction?
12
gravatar for Medhat
4.9 years ago by
Medhat8.6k
Texas
Medhat8.6k wrote:

Hi,

what tools you use or know for  PacBio Long Read error correction and why? or what is its pros and cons ?

ADD COMMENTlink modified 4.6 years ago by pengchy410 • written 4.9 years ago by Medhat8.6k

other algorithms for long reads errors correction?

ADD REPLYlink written 4.9 years ago by midox240

what's chimeric positions?
thank you

ADD REPLYlink written 4.8 years ago by midox240
4

PacBio reads can be chimeras - meaning a fusion of sequences that don't occur in that order in the sequenced sample. This can either happen, if subreads are not split properly (--subread--adapter--rev-comp-subread--) or during library preparation by random ligation of fragments. Chimeric positions should indicate such breakpoints in a read

ADD REPLYlink modified 3 days ago by RamRS25k • written 4.8 years ago by thackl2.7k

excuse me I did not understand the definition of "chimeric reads".

Is there a clear definition?

thank you

ADD REPLYlink modified 3 days ago by RamRS25k • written 4.8 years ago by midox240
2

http://drive5.com/usearch/manual/chimera_formation.html

ADD REPLYlink written 4.8 years ago by thackl2.7k

The flag you mentioned: --subread--adapter--rev-comp-subread--

What tool is that for? PacBio's consensus caller?

ADD REPLYlink modified 3 days ago by RamRS25k • written 4.0 years ago by Lynxoid220
2

What are chimeric reads?

ADD REPLYlink written 4.8 years ago by Medhat8.6k
11
gravatar for thackl
4.8 years ago by
thackl2.7k
MIT
thackl2.7k wrote:


Check out proovread (https://github.com/BioInf-Wuerzburg/proovread). It ..

  • .. outperforms PacBioToCA/LSC in terms of accuracy and contiguity/sensitivity (http://dx.doi.org/10.1093/bioinformatics/btu392)
  • .. is easy to install/run/configure
  • .. supports various types of dat
    • HiSeq/MiSeq (100-500bp)
    • Unitigs
    • 454, ...

proovread maps high coverage data to pacbio reads (bwa mem, blasr, daligner) in multiple iterations. This is not the fastest (if speed is your concern, go for LoRDEC http://www.atgc-montpellier.fr/lordec/) but the most thorough approach, giving you the most out of your PacBio data. You find some comparative stats here: https://github.com/BioInf-Wuerzburg/proovread/blob/master/README.org#crunching-numbers

 

 

 

ADD COMMENTlink modified 4.8 years ago • written 4.8 years ago by thackl2.7k
3

I used proovread to correct PacBio cDNA reads. It worked out of the box

I tried PacBioToCA and LSC, both were way too slow in my settings.

For me it was also important to get the corrected but untrimmed reads to retain full length transcripts. PacBioToCA for instance did not provide this option.

When correcting with Illumina RNA-seq short read data it is also helpful to normalize the data first to further speed up the correction. I used normalize-by-median.py of the khmer package https://github.com/ged-lab/khmer/blob/master/scripts/normalize-by-median.py

Works well with proovread, since proovread uses a coverage cutoff anyway and since it prioritizes reads mapping with fewer mismatches. 

ADD REPLYlink modified 4.8 years ago • written 4.8 years ago by AndreM30

Just for information, the hyperlink for LoRDEC in your post (http://www.atgc-montpellier.fr/lordec/) does not seem to work (404 not found).

ADD REPLYlink written 4.0 years ago by edrezen720

it can be visited here. http://www.atgc-montpellier.fr/lordec/

ADD REPLYlink modified 2 days ago by RamRS25k • written 4.0 years ago by pengchy410
4
gravatar for Charles Warden
4.9 years ago by
Charles Warden7.5k
Duarte, CA
Charles Warden7.5k wrote:

I've used PBcR in the Celera Assembler package: http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR

It works for hybrid assembly as well as just PacBio assembly (actually, I found self-correction worked better than hybrid correction with the dataset that I worked with)

If you haven't seen them already, I would also recommend viewing the tutorials on the PacBio website: https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/Bioinformatics-Workshop

I think there is at least 1-2 talks that review methods for de novo assembly and read correction.

ADD COMMENTlink modified 7 weeks ago by RamRS25k • written 4.9 years ago by Charles Warden7.5k
1

I saw the tutorial but is there any other tools ?

ADD REPLYlink written 4.9 years ago by Medhat8.6k
2

Off the top of my head, I don't recall which assembly tools specifically have an error correction step.

Some de novo assembly tools that I recall include HGAP (https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/HGAP) and MIRA (http://sourceforge.net/p/mira-assembler/wiki/Home/). I think the computer associated with the sequencer should have a de novo assembly algorithm, which I think is HGAP. I think MIRA does an error correction, but it only works with for hybrid-correction with Illumina reads.

Quiver can also be used to polish assemblies (so, correct errors post-assembly rather than pre-assembly):

https://github.com/PacificBiosciences/GenomicConsensus/blob/master/doc/HowToQuiver.rst

In fact, the HGAP link appears to recommend using Quiver as part of the assembly pipeline.

ADD REPLYlink modified 7 weeks ago by RamRS25k • written 4.9 years ago by Charles Warden7.5k
4
gravatar for midox
4.9 years ago by
midox240
Tunisia
midox240 wrote:

there is   "LoRDEC: accurate and efficient long read error correction", it uses DBG short reads to correct erroneous parts in Long reads.

it's a new program for correcting Long rads.

ADD COMMENTlink written 4.9 years ago by midox240

Note also that, as a side effect, LoRDEC generates an output HDF5 file holding the de Bruijn graph of the short reads. This HDF5 file can be the input for other tools on the short reads (de novo assembly with minia, SNP detection with discoSNP, ... see here)

ADD REPLYlink written 4.9 years ago by edrezen720
3
gravatar for Sean R Johnson
4.9 years ago by
United States
Sean R Johnson120 wrote:

ECTools (https://github.com/jgurtowski/ectools) is the one that has worked best for me. It's written for a particular kind of grid computing system though, so you may have to modify step 8 from their tutorial to suite your particular environment.

For running on a single server (which will be pretty slow, but this is just an example of how to wrap their scripts for a different scheduling system) I used the following bash script instead of steps 8 and 9

#! /bin/bash
export TMPDIR=/a/directory/for/temporary_files
mkdir -p $TMPDIR 
THREADS=12
NUM_PARTITIONS=0213 # should be a 4 character wide integer left-padded with zeros
NUM_FILES_PER_PARTITION=500
ORGANISM_NAME=organism_name

run_file() {
        export SGE_TASK_ID=$1
        ../correct.sh
}
export -f run_file

for i in `eval echo {0001..$NUM_PARTITIONS}` #braces evaluated before variable
do
 echo $i
        cd $i
        parallel -j $THREADS run_file ::: `eval echo {1..$NUM_FILES_PER_PARTITION}`
        cd ..
done
cat ????/*.cor.fa > ${ORGANISM_NAME}.cor.fa

 

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by Sean R Johnson120

 ECtools performs error correction of long reads?

ADD REPLYlink written 4.9 years ago by midox240

Well... according to the README file: "In short, the correction algorithm takes as input the unitigs from a short read assembly and uses them to correct long read data."

So, the answer to your question is yes. Although, much like some of the tools mentioned in other answers, it relies on short reads to do so.

ADD REPLYlink written 4.8 years ago by Sean R Johnson120
1
gravatar for Felix Francis
4.9 years ago by
Felix Francis490
United States/University of Delaware
Felix Francis490 wrote:

PacBioToCA (error correction via Celera Assembler) can also be used for error correcting PacBio reads using short reads including Illumina's.

Another one is LSC

ADD COMMENTlink modified 7 weeks ago by RamRS25k • written 4.9 years ago by Felix Francis490
1
gravatar for micha.hiller
4.8 years ago by
micha.hiller10
micha.hiller10 wrote:

I am very happy with proovread.

It is extremely flexible with respect to the type of Illumina data (HiSeqs, MiSeq, unitigs etc), quite fast, completely tunable and the author (Thomas Hackl) is very responsive. We have used it to correct lots of PacBios and it is extremly stable.

In contrast to ECTools, which takes much much longer and gives cluster jobs with unpredictable runtimes (depending on how many repeats the PacBio reads have), proovread jobs have a predictable runtime with little variation, which makes it easy to tailor jobs to the requirements of a compute cluster (runtimes, # cores etc). Memory usage is minimal. 

ADD COMMENTlink written 4.8 years ago by micha.hiller10
1
gravatar for Pawel Osipowski
4.8 years ago by
Poland, Warsaw
Pawel Osipowski20 wrote:

I used proovread recently to correct long reads by mapping short reads. For my volume I had to use HPC for one week to finnish correction. Author of this soft, Thomas, is very responsive, indeed. I also used pacbioToCA for pacbio self correction having 40x coverage, even though caveat for good performance is 50x. I didn't get satisfying results. With proovread you loose around 25% in pacbio length. With Celera in my case it was around 60%.

ADD COMMENTlink written 4.8 years ago by Pawel Osipowski20
0
gravatar for pengchy
4.6 years ago by
pengchy410
China/Beijing
pengchy410 wrote:

If we know the reference genome, why not correct the PacBio Transcriptome data using the transcript annotated from genome directly? The work will become only to align the PacBio reads to the reference transcripts. It seems easy work?

ADD COMMENTlink written 4.6 years ago by pengchy410

You may not have a reference, for example, if you are working with new species or with a metagenomic sample.

ADD REPLYlink written 4.0 years ago by Lynxoid220
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 757 users visited in the last hour