TL;DR I think not, prefer use dextractor
Recently I downloaded the NCTC11131 data set stored at EBI via the fastq-dump tool. I would like to assemble it with miniasm and minimap in order to reproduce results from te HINGE paper.
The obtained assembly was very fragmented with 34 contigs while HINGE's authors had only 2 contigs with the same dataset and software.
After discussions with HINGE's author we confirmed that we use the same version of minimap and miniasm but they downloaded bas.h5 directly and then used dextractor to extract the reads.
When I use same procedure than HINGE's authors to get reads, I get a similar assembly.
Thus the difference is in the way of getting the reads.
With fastq-dump we have:
Total sequences: 162577
Total length: 1366.615824 Mb
Longest sequence: 80.922 kb
Shortest sequence: 3 b
Mean Length: 8.405 kb
Median Length: 4.314 kb
N10: 3615 sequences; L10: 32.743 kb
N50: 25330 sequences; L50: 20.172 kb
N90: 71907 sequences; L90: 5.781 kb
With dextractor we have:
Total sequences: 167025
Total length: 769.911448 Mb
Longest sequence: 32.006 kb
Shortest sequence: 500 b
Mean Length: 4.609 kb
Median Length: 3.497 kb
N10: 4379 sequences; L10: 14.311 kb
N50: 39953 sequences; L50: 5.903 kb
N90: 121202 sequences; L90: 2.48 kb
fastq-dump extracts more bases and reads are longer compared to dextractor.
When mapping dextractor reads against fastq-dump ones, we realized that dextractor reads are often contained in fastq-dump reads.
fastq-dump id: ERR972361.54 read length: 9565
begin end dextractor_id
3216 7116 m150526_220338_00127_c1008…1823177310081531_s1_p0/53/3210_7121
7171 8309 m150526_220338_00127_c1008…1823177310081531_s1_p0/53/7167_8310
fastq-dump id: ERR972361.161 read length: 18592
begin end dextractor_id
0 11058 m150526_220338_00127_c1008…1823177310081531_s1_p0/160/0_11058
3784 11030 m150526_220338_00127_c1008…1823177310081531_s1_p0/160/11102_18592
11104 18589 m150526_220338_00127_c1008…1823177310081531_s1_p0/160/11102_18592
fastq-dump id: ERR972361.192 read length: 23763
begin end dextractor_id
86 6226 m150526_220338_00127_c10080…1823177310081531_s1_p0/191/6351_12100
1733 6303 m150526_220338_00127_c1008…1823177310081531_s1_p0/191/1729_6304
6350 12097 m150526_220338_00127_c1008…1823177310081531_s1_p0/191/6351_12100
6434 11362 m150526_220338_00127_c1008…1823177310081531_s1_p0/191/18486_23763
12146 18443 m150526_220338_00127_c1008…1823177310081531_s1_p0/191/12145_18442
18488 23762 m150526_220338_00127_c1008…1823177310081531_s1_p0/191/18486_23763
fastq-dump id: ERR972361.196 read length: 8567
begin end dextractor_id
4380 5912 m150526_220338_00127_c1008…1823177310081531_s1_p0/195/4380_5915
5963 8376 m150526_220338_00127_c1008…1823177310081531_s1_p0/195/5961_8376
So a question arises. Does not fastq-dump extract pacbio subreads (or badly) and only raw reads?
I didn't found any information that recommends not to use fastq-dump for pacbio data sets, but I may have missed something. Rob Edwards already emphasizes fastq-dump is not well documented https://edwards.sdsu.edu/research/fastq-dump/
In all cases, I think it's safer not to use fastq-dump for extracting pacbio reads, and I would recommend to use dextractor.
Version of the tools used :
- minimap 0.2-r123
- miniasm 0.2-r128
- fastq-dump 2.8.2
- dextractor 1.0p2
Hi pmarijon,
Although I suspect it's not intended as such, this post looks a lot like a question, perhaps also by the question in the title. Perhaps you could change that to make it more clear that you are reporting on your findings, rather than opening a question.
Cheers,
Wouter
Hi WouterDeCoster,
Thanks I change the title I hope it's more clear now ? (now this is a question :) )
Pierre
I'd thought more something like "Don't use fastq-dump for PacBio data" to be explicit and not look like a click-bait article, but okay :-)
"You won't believe what happened to this PhD student after he ran fastq-dump"
"You'd be surprised by the incredible story of this simple student when he uses the wrong tools to get reads"
Question being: how would one know that fastq-dump is the wrong tool in this case?
Hi,
Sorry I don't understand your question.
I think Bastien wants to know why we think the output of fastq-dump is "wrong". I recall it has to do with how the raw .bax.h5 PacBio files get converted to FASTQ. There seems to be additional step(s) needed in order to split raw reads and avoid chimeric reads.
dextractor
does them, butfastq-dump
, in that case, doesn't.or comparison between fastq-dump and dextractor for downloading and processing pacbio data.