Tutorial: Polish PacBio assembly with latest PacBio tools : an affordable solution for everyone
gravatar for Roxane Boyer
4 weeks ago by
Roxane Boyer380
Roxane Boyer380 wrote:

This tutorial addresses to anyone in possession of PacBio sequencing data. After struggling myself to find and test appropriated tools, learning PacBio specific format, asking for precisions to kind PacBio guys, I've managed to gather a lot of useful informations that everyone may need in order to process their data.

I will mainly focuses one of the most important step while doing PacBio genome assembly : the polishing step. Indeed, higher indel rate within PacBio raw reads make it necessary to correct the sequence obtained after assembly. This step consist in two things :

  • aligning your raw PacBio reads against the reference (your freshly baked PacBio assembly)
  • correct the reference base by base to obtained a polished assembly, which will become your final assembly.

PacBio changed their tools lately, getting ride of some strange format (cmp.h5). But the latest tools are not directly available on pacBio's github, and the one proposed in github are obsolet or really hard to install (I'm refeering to pitchfork). Using advices from Mr Stucki working at PacBio, I managed to install easily and quickly all PacBio tools required for polishing. Today, I'm going to describe all the steps I've followed so anyone could access as well at PacBio tools.

Installation of PacBio tools

  1. Download SMRT link : wget https://downloads.pacbcloud.com/public/software/installers/smrtlink_5.0.1.9585.zip ; unzip smrtlink_5.0.1.9585.zip -d destination_folder

Just to warn you, it seems that SMRT Link is not supported on Windows. The steps I'm describing were tested on Ubuntu 14.04 Trusty Tahr.

SMRT Link is a big tool that include all services PacBio can offer. If you have access to a cluster facility and if you match the requirements, you should probably follow classic installation instructions. However, if you just need one or 2 of PacBio tools, you probably don't want to install the whole suit, which is very heavy to install. Here what you should do :

  1. Add a new non-root user "smrtanalysis" : sudo adduser smrtanalysis
  2. Define in your bash what SMRT_USER is : SMRT_USER=smrtanalysis
  3. Log on as the freshly created user : su $SMRT_USER -> Extra tip : type bash to grab your bash environment and have access to auto-completion while typing in terminal
  4. Define SMRT_ROOT : SMRT_ROOT=opt/pacbio/smrtlink -> Extra tip : The $SMRT_ROOT directory must not exist when the installer is invoked, or installation will abort
  5. Launch SMRT analysis with option "tools-only" : ./smrtlink_5.0.1.9585.run --rootdir $SMRT_ROOT --smrttools-only
  6. After installation completed, move into directory that contains PacBio binaries : cd /opt/pacbio/smrtlink/smrtcmds/bin -> Extra tip : add this directory in your PATH bash variable, so you will be able to launch PacBio tool from anywhere
  7. Congratulations ! You can now use every PacBio tools ! Here what you are supposed to have within this directory :

arrow DBdust ice_polish pbsmrtpipe bam2bam DBrm ice_quiver pbsv bam2fasta
DBshow ipdSummary pbsvutil bam2fastq DBsplit ipython pbtranscript bamtools DBstats
ipython2 pbvalidate bamtools-2.4.1 dexta
juliet picking_up_ice bax2bam fasta2DB
julietflow python blasr fasta-to-gmap-reference LA4Falcon python2 ccs fasta-to-reference LA4Ice
python2.7 cleric fuse laa quiver daligner gmap LAmerge REPmask daligner_p gmap_build LAsort samtools datander gmapl motifMaker sawriter dataset HPC.daligner ngmlr separate_flnc dazcon
HPC.REPmask pbalign summarizeModifications DB2Falcon HPC.TANmask pbdagcon TANmask DB2fasta
ice_combine_cluster_bins pbindex undexta DBdump
ice_partial pbservice variantCaller

Assembly polishing with arrow

Now that you successfully installed PacBio tools (I hope !), you can start to do use them. As i said, I'm going to focus on the polishing step. The two main tools for polishing are arrow and quiver, two different algorithms included in the variantCaller tool (the polishing tool). If you already wonder about the differences about the two tools, here the differences between the two of them :

Quiver is supported for PacBio RS data. Arrow is supported for PacBio Sequel data and RS data with the P6-C4 chemistry.

As I understand it on PacBio github (https://github.com/PacificBiosciences/GenomicConsensus), the latest and thus more efficient tool is arrow. As I was myself very disappointed by quiver result (my case is particular, high polymorphism degree), I'm only going to describe polishing steps with the latest arrow algorithm.

Before to describe steps, I'm going to explain a few of my vocabulary about PacBio data. PacBio result, at least for my project, but I heard it was all the same everywhere, are organized by runs. Each run can contain several folder corresponding to one SMRT_cell. In an SMRT_cell directory, you normally have 3 .bax.h5 files, and 1 .bas.h5 files. What I call an SMRT_cell are the 3 bax.h5 files.

Also, something really important I've made, you may probably want to know the polishing coverage you are going to use. If you have a small data set with reasonable coverage (40-50X), then you should probably use all your raw reads for polishing. However, if you dataset is bigger, then you should probably take a subset of your raw reads. I've heard that polishing at very high coverage (>100X), take longer without improving considerably the quality of your assembly. 80X polishing coverage is what I used for correcting my assembly. You can roughly estimate your polishing coverage by converting one SMRT_cell (3 bax files) into fasta format by using the tool pls2fasta (is present in blasr package). Then you can count the total numbers of bases for that SMRT_cell and calculate your coverage with the size of your genome species.

An example of pls2fasta command I used : pls2fasta rawReadPB_s1_p0.2.bax.h5 rawReadPB_s1_p0.2.fasta -trimByRegion

Now everything is explained, let's get thing started :

  1. Create a .fofn file for each SMRT_cell. A fofn file (File Of Files Names) look like this :

/media/loutre/GENOMIC/pacbio_reads/run2/A02_1/Analysis_Results/rawPb1_s1_p0.1.bax.h5 /media/loutre/GENOMIC/pacbio_reads/run2/A02_1/Analysis_Results/rawPb2_s1_p0.2.bax.h5 /media/loutre/GENOMIC/pacbio_reads/run2/A02_1/Analysis_Results/rawPb3_s1_p0.3.bax.h5

I can't save you with an extra tip this time.. I did this by hand (maybe someone will give a tip in comment ?)

  1. For each .fofn file, launch bax2bam : a tool that will simply convert your raw bax file into a more convenient format. But be careful, theses BAM are not alignment yet. Here some bash scraps that may help you starting (sorry code insertion was not working for this):
for FOFN in "$FOFN_DIR"/* do
   echo "Dealing with file "$FOFN
   ./bax2bam -f $FOFN -o $BAM_DIR"/SMRT-cell-"$COMPT --subread  --pulsefeatures=DeletionQV,DeletionTag,InsertionQV,IPD,MergeQV,SubstitutionQV,PulseWidth,SubstitutionTag

This command will create several bam files : - .scraps.bam : contains all the sequences that were filtered out: low quality regions of the polymerase reads, adapter sequences (, barcodes, control sequences). These were stored (and labelled) in the bax files, and are basically “split” to a different BAM file when converting. - .subreads : your high-quality reads that you want to use for pbalign - .pbi : pacBio index

  1. For each PacBio subreads.bam , launch pbalign to align your reads against your reference. This can produce either a bam or a sam file. This time, the output is an alignment. You can launch pbalign as following :
for PB_BAM in "$BAM_DIR"/*"subreads.bam"
  echo "Dealing with file "$PB_BAM
  ./pbalign --nproc 32 $PB_BAM your/assembly.fasta $ALIGN_DIR"/align_"$COMPT".bam"

-> Extra tip : don't forget to adjust the number of cores to your own possibility

  1. Merge all the produced bam files using samtools merge -> Extra tip : you will probably have to sort all your bam files before merging them.
  2. Use arrow using your reference file and your merged bam. My advise is to personalize this step depending on your need. Several options are proposed, like a diploid option to allow more polymorphism, also a lot of read selection filtering options are available. So no copy/paste for this time ! -> Extra tip : You should have an eye on arrow help : arrow -h... Obvious Ostrich strikes again !

I really hope all theses informations could be useful to anyone. I really struggled to gather all theses informations. Took me almost 2 hours to write this tutorial. Of course, this is not the absolute way of using PacBio tools ! But if you are lost, starting with PacBio tools, you can use this as a start to your analysis...

I'm open to feed back, corrections, precisions, questions... I just wanted to help !



ADD COMMENTlink modified 16 days ago by zusman0 • written 4 weeks ago by Roxane Boyer380

Hi Roxane, thanks for this nice howto. That's what is missing in the PacBio docs and also in Canu docs.

You could use a dataset when aligning the raw reads back to the assembly.fasta, in this case you would only need one single "pbalign" call and there would be no need to merge BAM files.

dataset create --type SubreadSet --name myGenomeSet.xml *.subreads.bam

and then

pbalign --nproc 32 myGenomeSet.xml your/assembly.fasta myGenomeSet.bam
ADD REPLYlink written 23 days ago by sklages60

Thanks for that tip ! So when you create the dataset, you can specify the desired coverage for example ?

ADD REPLYlink written 22 days ago by Roxane Boyer380

No, I don't think so. With command "create" it simply creates a XML file with all relevant infos about the BAM files.

ADD REPLYlink written 22 days ago by sklages60

Thanks so much Roxane - that's really great and will be helpful for the future! I may be using PacBio.

ADD REPLYlink written 4 weeks ago by Kevin Blighe3.2k

You're welcome ! Glad it could help :)

ADD REPLYlink written 4 weeks ago by Roxane Boyer380

Dear Roxane,

Thanks for a detailed post on Pacbio. I had been working on Illumina data but now I have been assigned to work on Pacbio data. Since I am new to this, I have few queries in mind. Firstly: I have 3 bax.h5 and 1 bas.h5 files other than .csx ans .xml file. If i am not mistaken, bas.h5 file only contains the quality information for the reads. So, for analysis we require .bax.h5 files. Also each .bax.h5 files correspond to different sample. After installation of SMRT, the first step is polishing using bax2bam that will remove all low quality and adapter sequences from the reads. After aligning the subreads.bam file to reference, why do we need to merge all bam files. And you you mean merge all bam files coming from each .bax.h5.

I need to perform quasispecie analysis for my data. once i have bam file using pbalign can i directly use the bam file to reconstruct quaasispecies or there are other necessary steps to take before i do that.

Much Thanks, zusman

ADD REPLYlink written 16 days ago by zusman0

Dear Zusman,

If i am not mistaken, bas.h5 file only contains the quality information for the reads. So, for analysis we require .bax.h5 files. Also each .bax.h5 files correspond to different sample.

Everything you said is correct so far.

Also each .bax.h5 files correspond to different sample. After installation of SMRT, the first step is polishing using bax2bam that will remove all low quality and adapter sequences from the reads.

But here, I'm not sure you clearly understood the process. bax2bam is not yet the tool that will perform polishing. bax2bam allow you to convert the "weird" raw PacBio format into a more convenient format : bam (note this is not yet an alignement). So bax2bam does not remove any read, it just convert raw reads into bam reads.

After aligning the subreads.bam file to reference, why do we need to merge all bam files. And you you mean merge all bam files coming from each .bax.h5.

Not really sure about your question here. If you have follow my logic above (but once again, this is a way of doing it, not necessary THE way), you will end with a alignement bam file for each SMRT cell (3 bax.h5 file). You are merging the bam file because arrow (this one is the polishing tool), takes only one alignement file as an input.

I need to perform quasispecie analysis for my data. once i have bam file using pbalign can i directly use the bam file to reconstruct quaasispecies or there are other necessary steps to take before i do that.

I have absolutely no idea (yet) of what quaasispecies is. I think you should explain this to me a little bit further.

I'm not sure I answered your question. Tell me if I missed something in your questions.



ADD REPLYlink written 16 days ago by Roxane Boyer380

Can you guide me how i can process my data. After installation, i need to perform polishing and as you mentioned you used Arrow. Can you mention the command you used. So Arrow removes the low quality reads and adapters?? Once i have performed polishing using Arrow, i can use pbalign to align my each of filtered .bax.h5 files. Or is it that we dont need to perform filtering and directly align against reference. I am confuse how pacbio works. Quasispecies is set of different variants within a viral population. My aim to to check for every patient at different timepoints how viral population behaved. For that i have to evaluate each bax.h5 file and extract information of possible quasispecies and compare across different timepoints.

ADD REPLYlink written 16 days ago by zusman0

arrow works on aligned data. Maybe you should read [1] to get an idea what arrow is doing.

Best, Sven

[1] = https://github.com/PacificBiosciences/GenomicConsensus/blob/master/README.md

ADD REPLYlink written 15 days ago by sklages60
gravatar for bhagyathimmappa
27 days ago by
bhagyathimmappa0 wrote:

Hello, Thank you for the post. Do you have any idea about how to improve available assembly using PacBio reads?

ADD COMMENTlink written 27 days ago by bhagyathimmappa0

I never had to do such a thing, only tried de novo assembly. I would suggest you to read these post (if it's not already done), you can probably grab some useful answers : Gap-filling and scaffolding using PacBio reads Scaffolding with pacbio reads

ADD REPLYlink written 25 days ago by Roxane Boyer380
gravatar for s.kyungyong64
17 days ago by
Berkeley, USA
s.kyungyong640 wrote:

Thanks Roxane!

I am about to polish my contigs, and your post is really helpful! My pacbio reads have primer sequences at the both ends of reads ( ~ 60nt). So I ended up having to generate the CCS reads, trim the adapters and assemble it with Canu. But since I can't modify the sam file of subreads, I am afraid these adapters may ruin the alignment and generate errors in polishing. Do you have any idea about this?

ADD COMMENTlink written 17 days ago by s.kyungyong640

Hi Kyungyong !

So I'm not really used to PacBio dataset with adaptaters, but I saw an option in the bam2bam tools from pacbio that allow to specify adaptaters and barcodes sequence. Maybe it allow a kind of filter to put apart adapaters form the output bam ?

Here is the SMRT Tools Guide that may be really useful : http://www.pacb.com/wp-content/uploads/SMRT-Tools-Reference-Guide-v4.0.0.pdf

I also found an other pacbio tools called lima : https://github.com/PacificBiosciences/barcoding

To me, it look like something you can do. But maybe I'm wrong... Tell me if theses were useful to you !


ADD REPLYlink written 17 days ago by Roxane Boyer380
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1304 users visited in the last hour