TL;DR I think not, prefer use dextractor
Recently I downloaded the NCTC11131 data set stored at EBI via the fastq-dump tool. I would like to assemble it with miniasm and minimap in order to reproduce results from te HINGE paper.
The obtained assembly was very fragmented with 34 contigs while HINGE's authors had only 2 contigs with the same dataset and software.
After discussions with HINGE's author we confirmed that we use the same version of minimap and miniasm but they downloaded bas.h5 directly and then used dextractor to extract the reads.
When I use same procedure than HINGE's authors to get reads, I get a similar assembly.
Thus the difference is in the way of getting the reads.
With fastq-dump we have:
Total sequences: 162577 Total length: 1366.615824 Mb Longest sequence: 80.922 kb Shortest sequence: 3 b Mean Length: 8.405 kb Median Length: 4.314 kb N10: 3615 sequences; L10: 32.743 kb N50: 25330 sequences; L50: 20.172 kb N90: 71907 sequences; L90: 5.781 kb
With dextractor we have:
Total sequences: 167025 Total length: 769.911448 Mb Longest sequence: 32.006 kb Shortest sequence: 500 b Mean Length: 4.609 kb Median Length: 3.497 kb N10: 4379 sequences; L10: 14.311 kb N50: 39953 sequences; L50: 5.903 kb N90: 121202 sequences; L90: 2.48 kb
fastq-dump extracts more bases and reads are longer compared to dextractor.
When mapping dextractor reads against fastq-dump ones, we realized that dextractor reads are often contained in fastq-dump reads.
fastq-dump id: ERR972361.54 read length: 9565 begin end dextractor_id 3216 7116 m150526_220338_00127_c1008…1823177310081531_s1_p0/53/3210_7121 7171 8309 m150526_220338_00127_c1008…1823177310081531_s1_p0/53/7167_8310 fastq-dump id: ERR972361.161 read length: 18592 begin end dextractor_id 0 11058 m150526_220338_00127_c1008…1823177310081531_s1_p0/160/0_11058 3784 11030 m150526_220338_00127_c1008…1823177310081531_s1_p0/160/11102_18592 11104 18589 m150526_220338_00127_c1008…1823177310081531_s1_p0/160/11102_18592 fastq-dump id: ERR972361.192 read length: 23763 begin end dextractor_id 86 6226 m150526_220338_00127_c10080…1823177310081531_s1_p0/191/6351_12100 1733 6303 m150526_220338_00127_c1008…1823177310081531_s1_p0/191/1729_6304 6350 12097 m150526_220338_00127_c1008…1823177310081531_s1_p0/191/6351_12100 6434 11362 m150526_220338_00127_c1008…1823177310081531_s1_p0/191/18486_23763 12146 18443 m150526_220338_00127_c1008…1823177310081531_s1_p0/191/12145_18442 18488 23762 m150526_220338_00127_c1008…1823177310081531_s1_p0/191/18486_23763 fastq-dump id: ERR972361.196 read length: 8567 begin end dextractor_id 4380 5912 m150526_220338_00127_c1008…1823177310081531_s1_p0/195/4380_5915 5963 8376 m150526_220338_00127_c1008…1823177310081531_s1_p0/195/5961_8376
So a question arises. Does not fastq-dump extract pacbio subreads (or badly) and only raw reads?
I didn't found any information that recommends not to use fastq-dump for pacbio data sets, but I may have missed something. Rob Edwards already emphasizes fastq-dump is not well documented https://edwards.sdsu.edu/research/fastq-dump/
In all cases, I think it's safer not to use fastq-dump for extracting pacbio reads, and I would recommend to use dextractor.
Version of the tools used :
- minimap 0.2-r123
- miniasm 0.2-r128
- fastq-dump 2.8.2
- dextractor 1.0p2