Question: Pacbio: extract fastq from h5 file based on quality filtering
1
gravatar for merodev
4.1 years ago by
merodev140
United States
merodev140 wrote:

Hi, I am new to pacbio and have 2 sets of .h5 files as output from pacbio. I am planning to use celera assembler and for that i need fastq files from .h5 files.

1) Is there any way to convert .h5 to fastq.

2) Is there any specific method to filter pacbio reads based on quality?

3) Do we combine both sets of data and then work on it for assembly?

Thanks!

ADD COMMENTlink modified 2.4 years ago by mehmetgoktay19890 • written 4.1 years ago by merodev140

The quality values are sufficiently low that reads may be artificially trimmed by celera.  I've found it's best to just fake fastq from fasta with high enough quality value that reads are retained.  The assembly quality needs to be improved later using quiver.

ADD REPLYlink written 4.1 years ago by mchaisso160
5
gravatar for Biomonika (Noolean)
4.1 years ago by
State College, PA, USA
Biomonika (Noolean)3.0k wrote:

1) and 2) Use bash5tools.py

bash5tools.py --minLength 500 --readType subreads --minReadScore 0.8 --outType fastq 

Depends on your dataset, but if you just sequenced 2 SMRT cells to get more coverage, then you can merge them prior to assembly.

https://github.com/PacificBiosciences/pbh5tools/blob/master/doc/index.rst

ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by Biomonika (Noolean)3.0k
1
gravatar for thackl
4.1 years ago by
thackl2.6k
MIT
thackl2.6k wrote:

Have a look at dextract. It's very quick and lets you set a score cutoff. However, I think it only generates FASTA.

https://dazzlerblog.wordpress.com/2014/03/22/the-dextractor-module-save-disk-space-for-your-pacbio-projects/

ADD COMMENTlink written 4.1 years ago by thackl2.6k
1

dextract can generate FASTQ if you add -q paramenter.  To filter fastq with minimum Read Quality 0.80, use -s800 (default: 750)

dextract -q *.bax.h5 -s800 > raw_reads_RQ0.80.fastq

 

 

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by rtliu2.0k

How to combine it with find command e.g. find All_RawData/Each_Cell_Raw/ -name "*.bax.h5" | xargs -I {} dextract -q {} > How to get the file name?

ADD REPLYlink modified 20 months ago • written 20 months ago by Ric190

https://www.everythingcli.org/find-exec-vs-find-xargs/

ADD REPLYlink written 20 months ago by h.mon24k
0
gravatar for Jean-Karim Heriche
4.1 years ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche18k wrote:

I've written a perl wrapper for the hdf5 library that you might find useful. It hasn't been tested on pacbio files though I have no reason to think it wouldn't read them.

ADD COMMENTlink written 4.1 years ago by Jean-Karim Heriche18k
0
gravatar for mehmetgoktay1989
2.4 years ago by
mehmetgoktay19890 wrote:

Could you please tell me If bash5tools.py also removes adapter sequences?

I have just used it and got subreads from raw data but I am not sure whether subreads still contains adapter sequences?

ADD COMMENTlink written 2.4 years ago by mehmetgoktay19890

You need to post this as another question, also please refer to the manual

ADD REPLYlink written 2.4 years ago by Rohit1.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 772 users visited in the last hour