PacBio interpulse duration (IPD) data
2
5
Entering edit mode
4.6 years ago
bgbrink ▴ 60

I started working on my first project with PacBio sequencing data, and after 2 days filled with fruitless googling, missing libraries and failed c compilations, I decided it's time to ask for help.

The information on tools and pipelines for PacBio data is scattered everywhere and most of it seems horribly out of date. I was hoping some of you could help me to get started and other lost souls in the future will hopefully find this post and save some time and frustration.

The data:

I have a set of Primary Analysis Data available, as explained in the first paragraph here.

The tools:

After some struggle, I managed to compile the latest version of blasr and pbalign on Ubuntu 16.04 using pitchfork. I also managed to install the R packages h5r, pbh5, and seqPatch. If anyone is reading this and has trouble with R and HDF5 libraries under Ubuntu, see my question here.

What I don't have:

What I would like to do:

I would like to access the interpulse duration (IPD), as explained in this white paper, and preferably access this data in R. I am open to suggestions for other tools/programming languages as well though.

Problem:

I need cmp.h5 files to load the IPDs from the R packages. How do I generate those? When I try to run pbalign, it says pbalign no longer supports CMP.H5 Output in 3.0. Is there any other way to get to the IPDs without going through cmp.h5 files?

Thank you very much for your time.

sequencing next-gen software error alignment • 4.0k views
2
Entering edit mode

PacBio has a wiki dedicated to training for PacBio data. In case you have not discovered it.

Here is technical info about h5 format PacBio uses and the tools they provide.

0
Entering edit mode

I did see the training, but it was not very helpful, since most of it is tailored to PacBio's SMRT Portal/SMRT Link platform (what's the difference anyway?). I will have another look though, since I also missed the python script you mentioned. Thanks a lot for pointing that out!

4
Entering edit mode
4.6 years ago
tjduncan ▴ 270
• What organism is your data from?
• What is the genome size of the organism? What Coverage do you have?
• What PacBio instrument was your data generated on RSII (output is a bax.h5 file) or Sequel (output is an unaligned .bam file)?

The instrument that the data was generated on is important as it will determine the BFX tools you can use for analysis.

• SMRT Portal is designed to be used with RSII data and accept raw bax.h5 files for input. It was last updated in Nov 2014 so unfortunately not actively maintained. You will be reserved to scrolling through outdated GitHub info and the interwebs for analysis help.

• SMRT Link is designed to be used with Sequel data and accept an unaligned .bam file for input. It is somewhat backwards compatible with RSII data. There is a bax2bam command that allows you to convert RSII data to the same unaligned bam format of Sequel data. This is sufficient for most applications but I don't think it works (correct me if I am wrong?) for base mod work because the IDP info is not conserved upon file conversion.

I am not aware of a way you can get around using a cmp.h5 file for IDP information. It is also likely it will be hard to get out of downloading and using either SMRT Portal of SMRT Link in one way or another. Luckily both of them can be used relatively easily on a workstation (dependent on genome size of your organism). SMRT Portal can be run in GUI format on a workstation and SMRT Link can be run in command-line only format (without having to set up a full SMRT Link server).

SMRT Portal

• Here is a Biostars link to instructions on how to install the command-line only tools It is wayyyy better than the official instructions.
• Here is the SMRT Tools Reference Guide - It is an in-depth list of all the commands possible in smrt link and their options. Check out pages 41 -43 for MotifMaker.

• Check out pages 52-58 of the reference guide for pbsmrtpipe, this will allow you to run the whole ds_motif_modification_analysis pipeline that allows you to generate the .csv file seen in the whitepaper. Specifically page 58 shows the command that would run this pipeline (with a slight ID modification).

Once you have ran a base mod pipeline in either SMRT Portal or SMRT Link (via pbsmrtpipe) you should have output .csv, .gff. and cmp.h5 files that you can do tertiary analysis on using whatever you want. There are also a few tools available from PacBio that run downstream of the initial analysis.

• PacBio Base Mod tools this is the link to their GitHub of additional tools. It looks like only kineticsTools, MotifMaker, and MotifFinder have been updated for Sequel data.

There are also a handful of methods developed by other researchers to use IDP / base mod data from PacBio data. You could look at some of the published papers to get additional ideas.

1
Entering edit mode

I have the Pacbio data generated by Sequel (.bam) of a plant genome. I want to analysis the methylome of this plant using AgIn. However, I noticed that the modification.csv file is required in AgIn. I do not have this file. Could you @tjduncan tell me how to generate this file ?

0
Entering edit mode

Thanks, this was really helpful. As I mentioned in my original post, I have RSII data available. Thus, I could not use SMRT Link. However, I was able to run the old SMRT analysis on my laptop.

1
Entering edit mode
3.8 years ago
lr65358 ▴ 20

You need an older version of pbalign.

conda install -c bioconda blasr

cd pbalign

git checkout 6c8618cfee963e2167100cb0b293aedf85f32dcf

sudo pip install .