Question: IPD data in PacBio cmp.h5 v2.0
0
gravatar for mjp
2.7 years ago by
mjp10
USA
mjp10 wrote:

I'm trying to extract the IPD values for kinetics analysis from publicly available data (3 x bax.h5, bas.h5). However when using the R-kinetics scripts, I seem to be getting a response that there is no IPD data incorporated, although I explicitly state in my preparation steps that I do want them and do not get any message warnings that is not carried along.

The pipeline I'm using so far is:

bax2bam -o PREFIX file.1.bax.h5 file.2.bax.h5 file.3.bax.h5 --subread --pulsefeatures=DeletionQV,DeletionTag,InsertionQV,IPD,PulseWidth,MergeQV,SubstitutionQV,SubstitutionTag --losslessframes

blasr PREFIX.subreads.bam refGenome.fa --out file.bam
samtools sort file.bam file_sorted
samtools view -h -o file.sam file_sorted.bam
samtoh5 file.sam refGenome.fa file.cmp.h5

The version of programs I'm using:

bax2bam = v0.0.8

blasr = v.5.2

samtoh5 = v1.0.0.141782

After all this I get the cmp.h5 that I'm able to read in and view but there is not IPD details inside of it. I also used independently and on top of the cmp.h5 to to load the pulse data with loadPulse but I get message:

$ loadPulses movie_s1_p0.bas.h5 file.cmp.h5
[INFO] 2016-07-28T14:59:34 [loadPulses] started.
WARNING: There is insufficient data to compute metric: ClassifierQV in the file movie_s1_p0.1.bax.h5  It will be ignored.
WARNING: There is insufficient data to compute metric: pkmid in the file movie_s1_p0.1.bax.h5  It will be ignored.
loading 82011 alignments for movie 1
ERROR, the query sequence does not match the aligned query sequence.
HoleNumber: 32516, MovieName: movie_s1_p0, ReadIndex: 32516, qStart: 4404, qEnd: 9220
Aligned sequence: 
TTGAAAGAAAA....
Original sequence: 
GGTAAAACATT....

Now, there has to be something simple that I am missing. Any assistance would be appreciated.

pacbio methylation • 1.6k views
ADD COMMENTlink modified 2.7 years ago by rhall160 • written 2.7 years ago by mjp10
0
gravatar for rhall
2.7 years ago by
rhall160
United States
rhall160 wrote:

That workflow isn't going to work, to align and keep IDP values you will have to use either a 2.3 style bax.h5 -> pbalign -> cmp.h5 ipbsummary on the cmp.h5 file, or a 3.x bax2bam -> pbalign -> bam -> ipbsummary on the bam file.

ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by rhall160

Thank you. That gives me some hope. I see in both scenarios you recommended pbalign. My version is 3.0.

Since I use public data, obtaining files in different version is rather not possible, therefore I guess I will go with the second option.

Would you be able to assist where to download v3.x bax2bam as pitchfork only installs v0.0.8? I'm planning to run it on a cluster so I do everything locally. Thanks in advance.

ADD REPLYlink written 2.7 years ago by mjp10
1

That 3.x reference may be to SMRTportal v.3.x. AFAIK v.3.x is only available to Sequel owners.

ADD REPLYlink written 2.7 years ago by genomax65k

Hmm, I see. That does clarify some things. Thank you for bringing this up!

I guess I need to figure out a different workflow as my lab does not own a Sequel. What a pity that I'm not going to be able to test the software. I guess using public data is very limited then if you don't own an instrument (?).

I wonder how reviewers of papers using PacBio data are dealing with that matter...? No instrument -> no access to the latest software -> not able to reproduce the analysis.

ADD REPLYlink written 2.7 years ago by mjp10

There is no public Sequel data as far as I know. Many are waiting. I think the new software is only needed for Sequel so any data you have access to should be analyzable by SMRTportal v.2.3.x (in theory).

ADD REPLYlink written 2.7 years ago by genomax65k

Thanks. As we speak I actually am running something in a more automated way. I tried to break the pipeline into steps to be able to tweak it to my needs and to better understand it but it seems it is next to impossible.

ADD REPLYlink written 2.7 years ago by mjp10

While I'm waiting for some news regarding where to find the most recent software to analyze the bacterial methylomes, I wanted to add couple of things that will hopefully help others.

I was unsuccessful in using pbalign (v3.0) in conjunction with blasr (v5.2) to produce cmp.h5 file as I didn't have bam file as an input. Using the same executables, I was unsuccessful in generating sam file (it is deprecated) which I could port into samtoh5 (also soon to be deprecated). The message said to produce bam and convert it to sam which tried and posted my original question.

It seems that blasr (v5.2) is therefore only targeted to a group of people who have bam file as input.

ipdSummary v2.2 cannot handle the bam input file that came out from bax2bam (v0.0.8). The help message states that cmp.h5 file is required for input, which is indeed the error message that I see when trying to execute it with bam file produced by bax2bam (v0.0.8).

I'm not sure if that will be possible even with newer version of ipdSummary. Will be happy to check and report back once it is made available to me.

ADD REPLYlink written 2.7 years ago by mjp10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 923 users visited in the last hour