SRA Toolkit: using fastq-dump vs fasterq-dump - discrepancies in output?
1
3
Entering edit mode
10 months ago
Hamish ▴ 40

Hi

I am trying to get paired-end fastqs from a number of dbgap-restricted SRA files and am unsure if my output files are correct. Basically the process I've followed it to use SRA Toolkit (version 10.8.3 running on Ubuntu) to prefetch the files, validate the download with vdb-validate, and then convert this .sra into fastq. I have used both fasterq-dump and fastq-dump to achieve this and my output fastq files from each are of different sizes.

The steps I'm taking are as follows using SRR1293521 as an example:

./prefetch --ngc prj_26006.ngc SRR1293521


This succeeds with no errors

./vdb-validate --ngc prj_26006.ngc SRR1293521/SRR1293521_dbGaP-26006.sra


All validation tests are passed.

3. Convert to fastq with fasterq-dump: I first make a copy of SRR1293521_dbGaP-26006.sra and rename the file SRR1293521 because it fails with the default name.

./fasterq-dump --ngc prj_26006.ngc SRR1293521/SRR1293521


Output:

Resulting in 3 files:

SRR1293521_1.fastq 5.3GB

SRR1293521_2.fastq 5.3GB

SRR1293521.fastq 19.4GB

4. Convert to fastq with fastq-dump: I use split-e here because instead of split-3 because it's a typo in the current codebase. and I use --skip-technical because according to this page, that should make this command functionally identical to the above fasterq-dump command.

./fastq-dump --split-e --skip-technical --ngc prj_26006.ngc SRR1293521/SRR1293521_dbGaP-26006.sra


Output:

Written 99531818 spots for SRR1293521/SRR1293521_dbGaP-26006.sra

Resulting in 3 files:

SRR1293521_dbGaP-26006_1.fastq 5.8GB

SRR1293521_dbGaP-26006_2.fastq 5.8GB

SRR1293521_dbGaP-26006.fastq 21.3GB

Is it expected to get different output from what I assumed were these functionally equivalent commands? If so, how do I know which fastq is the correct one? Usually I would download the raw fastq from ebi to cross-check but because it's a protected file this option isn't available. Also, would --split-files (resulting in 2 fastqs) be more suited than --split-e for this file?

Any suggestions would be much appreciated!

software error sratools fastq-dump fasterq-dump • 4.0k views
2
Entering edit mode
10 months ago
ATpoint 52k

What is the exact difference? I see three files in both. Size differences can be due to different header names. Start by confirming read numbers are the same in files between tools. Tbh, I'd just use the fast-dump one and get started with analysis.

0
Entering edit mode

I've checked what you asked using R1 as an example - both files are 85610524 lines long so will have the same amount of reads. The header information is different based off the names of the input file. I've run head on both files and using fasterq-dump, we have an example first read of:

@SRR1293521.1 1 length=83 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC +SRR1293521.1 1 length=83 CCCFFFFFHHHHFIJIIIJIJJJJJIJJJIIGIIIIIIIIGHAHIIIJIGIJJJJJJIJJHFHGFFCCEADCDDBBBA??BB<

Whereas fastq-dump with it's longer input filenames has an example first read of:

@SRR1293521_dbGaP-26006.1 1 length=83 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC +SRR1293521_dbGaP-26006.1 1 length=83 CCCFFFFFHHHHFIJIIIJIJJJJJIJJJIIGIIIIIIIIGHAHIIIJIGIJJJJJJIJJHFHGFFCCEADCDDBBBA??BB<

So you were spot on - the only difference between these two is the length of the header (due to the different input file names) and that would account for why the file sizes are different. Thanks for your insight with this.