Blog: Want to use fastq-dump to download pacbio data? Read this before
gravatar for pmarijon
3 months ago by
pmarijon110 wrote:

TL;DR I think not, prefer use dextractor

Recently I downloaded the NCTC11131 data set stored at EBI via the fastq-dump tool. I would like to assemble it with miniasm and minimap in order to reproduce results from te HINGE paper.

The obtained assembly was very fragmented with 34 contigs while HINGE's authors had only 2 contigs with the same dataset and software.

After discussions with HINGE's author we confirmed that we use the same version of minimap and miniasm but they downloaded bas.h5 directly and then used dextractor to extract the reads.

When I use same procedure than HINGE's authors to get reads, I get a similar assembly.

Thus the difference is in the way of getting the reads.

With fastq-dump we have:

Total sequences: 162577
Total length: 1366.615824 Mb
Longest sequence: 80.922 kb
Shortest sequence: 3 b
Mean Length: 8.405 kb
Median Length: 4.314 kb
N10: 3615 sequences; L10: 32.743 kb
N50: 25330 sequences; L50: 20.172 kb
N90: 71907 sequences; L90: 5.781 kb

With dextractor we have:

Total sequences: 167025
Total length: 769.911448 Mb
Longest sequence: 32.006 kb
Shortest sequence: 500 b
Mean Length: 4.609 kb
Median Length: 3.497 kb
N10: 4379 sequences; L10: 14.311 kb
N50: 39953 sequences; L50: 5.903 kb
N90: 121202 sequences; L90: 2.48 kb

fastq-dump extracts more bases and reads are longer compared to dextractor.

When mapping dextractor reads against fastq-dump ones, we realized that dextractor reads are often contained in fastq-dump reads.

fastq-dump id: ERR972361.54 read length: 9565
    begin   end dextractor_id
    3216    7116    m150526_220338_00127_c1008…1823177310081531_s1_p0/53/3210_7121
    7171    8309    m150526_220338_00127_c1008…1823177310081531_s1_p0/53/7167_8310
fastq-dump id: ERR972361.161 read length: 18592
    begin   end dextractor_id
    0   11058   m150526_220338_00127_c1008…1823177310081531_s1_p0/160/0_11058
    3784    11030   m150526_220338_00127_c1008…1823177310081531_s1_p0/160/11102_18592
    11104   18589   m150526_220338_00127_c1008…1823177310081531_s1_p0/160/11102_18592
fastq-dump id: ERR972361.192 read length: 23763
    begin   end dextractor_id
    86  6226    m150526_220338_00127_c10080…1823177310081531_s1_p0/191/6351_12100
    1733    6303    m150526_220338_00127_c1008…1823177310081531_s1_p0/191/1729_6304
    6350    12097   m150526_220338_00127_c1008…1823177310081531_s1_p0/191/6351_12100
    6434    11362   m150526_220338_00127_c1008…1823177310081531_s1_p0/191/18486_23763
    12146   18443   m150526_220338_00127_c1008…1823177310081531_s1_p0/191/12145_18442
    18488   23762   m150526_220338_00127_c1008…1823177310081531_s1_p0/191/18486_23763
fastq-dump id: ERR972361.196 read length: 8567
    begin   end dextractor_id
    4380    5912    m150526_220338_00127_c1008…1823177310081531_s1_p0/195/4380_5915
    5963    8376    m150526_220338_00127_c1008…1823177310081531_s1_p0/195/5961_8376

So a question arises. Does not fastq-dump extract pacbio subreads (or badly) and only raw reads?

I didn't found any information that recommends not to use fastq-dump for pacbio data sets, but I may have missed something. Rob Edwards already emphasizes fastq-dump is not well documented

In all cases, I think it's safer not to use fastq-dump for extracting pacbio reads, and I would recommend to use dextractor.

Version of the tools used :

  • minimap 0.2-r123
  • miniasm 0.2-r128
  • fastq-dump 2.8.2
  • dextractor 1.0p2
longreads blog • 615 views
ADD COMMENTlink modified 9 weeks ago by shengweima30 • written 3 months ago by pmarijon110

Hi pmarijon,

Although I suspect it's not intended as such, this post looks a lot like a question, perhaps also by the question in the title. Perhaps you could change that to make it more clear that you are reporting on your findings, rather than opening a question.


ADD REPLYlink written 3 months ago by WouterDeCoster30k

Hi WouterDeCoster,

Thanks I change the title I hope it's more clear now ? (now this is a question :) )


ADD REPLYlink modified 3 months ago • written 3 months ago by pmarijon110

I'd thought more something like "Don't use fastq-dump for PacBio data" to be explicit and not look like a click-bait article, but okay :-)

ADD REPLYlink written 3 months ago by WouterDeCoster30k

"You won't believe what happened to this PhD student after he ran fastq-dump"

ADD REPLYlink modified 3 months ago • written 3 months ago by Rayan Chikhi1.3k

"You'd be surprised by the incredible story of this simple student when he uses the wrong tools to get reads"

ADD REPLYlink written 3 months ago by pmarijon110

Question being: how would one know that fastq-dump is the wrong tool in this case?

ADD REPLYlink written 3 months ago by bastien.chevreux0


Sorry I don't understand your question.

ADD REPLYlink written 9 weeks ago by pmarijon110

I think Bastien wants to know why we think the output of fastq-dump is "wrong". I recall it has to do with how the raw .bax.h5 PacBio files get converted to FASTQ. There seems to be additional step(s) needed in order to split raw reads and avoid chimeric reads. dextractor does them, but fastq-dump, in that case, doesn't.

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by Rayan Chikhi1.3k

or comparison between fastq-dump and dextractor for downloading and processing pacbio data.

ADD REPLYlink modified 3 months ago • written 3 months ago by cpad01127.6k
gravatar for shengweima
9 weeks ago by
shengweima30 wrote:

Can you share the detail commanf of dextractor? how to filter the raw file

enter code here
ADD COMMENTlink written 9 weeks ago by shengweima30

I just run dextract on bax.h5 without any option.

dextract *.bax.h5

I didn't made any filter on the raw file.

ADD REPLYlink written 9 weeks ago by pmarijon110
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1508 users visited in the last hour