23 months ago by
France / Toulouse / GeT-Plage
Hi Picasa !
Indeed, you will need to perform a polishing step after assembly, this will lower your error rate from ~1% (after assembly with canu) to ~0.001-0.0000001% (depending of the coverage used for polishing, you can find information in this link : https://github.com/PacificBiosciences/GenomicConsensus/blob/master/doc/FAQ.rst section "What is the expected quiver accuracy? ")
So, you indeed need a alignement file, but with PacBio, everything is a bit more complicated. Considering their bas/bax.h5 format, you will need a particular aligner (blasr you said it right) that will produce an alignement. However, this alignment will be not in bam format, but in a specific format called .cmp.h5 (cmp strand for comparison I guess).
Blasr align file by file. In order to align all your .bas/bax.h5, you are going to use an other tool called pbalign. This tool can take as an input a .fofn file (file of file ... forget about the n) which is a list of all the path that lead to your raw data, and will literally call blasr for all your raw data file and then compile their result.
Here a small example of how a .fofn file should look (sorry for horrible paths) :
As it was really hard for me to gather this informations, here how I used pbalign :
pbalign --forQuiver raw_reads.fofn assembly_to_polish.fasta your_output.cmp.h5
Then, you can use quiver using something like this :
quiver your_output.cmp.h5 -r assembly_to_polish.fasta -o polished_assembly.fasta
For installing pacbio tools, I redirect you to this github thread : https://github.com/PacificBiosciences/pbalign/issues/67
Don't try to install all PacBio tools by yourself, don't even try pitchfork, you will lose a lot of time just as I do.
Also, a related Biostar post here : A: Choosing de novo genome assembly
Good luck with your polishing ! This way worked fine for me after weeks or research on this subject.