I have a set of PacBio long read data in *.bax.h5 format file. I want to detect DNA base modification (DNA methylation) using the pulse information of PacBio data. I know the pulse information are stored in pls.h5 file. But I don't have those file. Is it possible to extract some or any pulse information from *.bax.h5 file? Does the *.bax.h5 file contains any pulse information at all?
1) You can create an unaligned .bam file with the pulse information using
path/to/blasr/utils/bax2bam/bin/bax2bam -o outputPrefix /path/to/file.1.bax.h5 /path/to/file.2.bax.h5 /path/to/file.3.bax.h5
By default, IPD but not PulseWidth information is added. However, you can customize what features you want to add. For example:
path/to/blasr/utils/bax2bam/bin/bax2bam -o outputPrefix /path/to/file.1.bax.h5 /path/to/file.2.bax.h5 /path/to/file.3.bax.h5 --pulsefeatures=DeletionQV,DeletionTag,InsertionQV,IPD,PulseWidth,MergeQV,SubstitutionQV,SubstitutionTag --losslessframes
You can read more about the pacbio .bam file format here: http://pacbiofileformats.readthedocs.io/en/3.0/BAM.html
2) If you have an aligned cmp.h5 file (or a .sam alignment that you convert to a cmp.h5 file via
samtoh5), you can use
loadPulses to add base modification information:
loadPulses /path/to/file.bas.h5 /path/to/blasr.alignment.cmp.h5
samtoh5 are part of the blasr package:
You can then use R-kinetics to parse work with the base modification information in the alignment:
My understanding is that pacbio may not continue to maintain the samtoh5 function (as they switch to using the .bam file format), but you can find it under /path/to/blasr/utils (if compiled). Same is true for loadPulses, unless you compile using pitchfork.