Question

SRA Toolkit: using fastq-dump vs fasterq-dump - discrepancies in output?

3

Entering edit mode

4.8 years ago

Hamish ▴ 40

Hi

I am trying to get paired-end fastqs from a number of dbgap-restricted SRA files and am unsure if my output files are correct. Basically the process I've followed it to use SRA Toolkit (version 10.8.3 running on Ubuntu) to prefetch the files, validate the download with vdb-validate, and then convert this .sra into fastq. I have used both fasterq-dump and fastq-dump to achieve this and my output fastq files from each are of different sizes.

The steps I'm taking are as follows using SRR1293521 as an example:

Download SRA file:
```
./prefetch --ngc prj_26006.ngc SRR1293521
```
This succeeds with no errors

Validate SRA download

./vdb-validate --ngc prj_26006.ngc SRR1293521/SRR1293521_dbGaP-26006.sra

All validation tests are passed.

Convert to fastq with fasterq-dump: I first make a copy of SRR1293521_dbGaP-26006.sra and rename the file SRR1293521 because it fails with the default name.
```
./fasterq-dump --ngc prj_26006.ngc SRR1293521/SRR1293521
```
Output:

spots read : 99,531,818

reads read : 199,063,636

reads written : 120,934,449

Resulting in 3 files:

SRR1293521_1.fastq 5.3GB

SRR1293521_2.fastq 5.3GB

SRR1293521.fastq 19.4GB
Convert to fastq with fastq-dump: I use split-e here because instead of split-3 because it's a typo in the current codebase. and I use --skip-technical because according to this page, that should make this command functionally identical to the above fasterq-dump command.
```
./fastq-dump --split-e --skip-technical --ngc prj_26006.ngc SRR1293521/SRR1293521_dbGaP-26006.sra
```
Output:

Rejected 78129187 READS because READLEN < 1

Read 99531818 spots for SRR1293521/SRR1293521_dbGaP-26006.sra

Written 99531818 spots for SRR1293521/SRR1293521_dbGaP-26006.sra

Resulting in 3 files:

SRR1293521_dbGaP-26006_1.fastq 5.8GB

SRR1293521_dbGaP-26006_2.fastq 5.8GB

SRR1293521_dbGaP-26006.fastq 21.3GB

Is it expected to get different output from what I assumed were these functionally equivalent commands? If so, how do I know which fastq is the correct one? Usually I would download the raw fastq from ebi to cross-check but because it's a protected file this option isn't available. Also, would --split-files (resulting in 2 fastqs) be more suited than --split-e for this file?

Any suggestions would be much appreciated!

software error sratools fastq-dump fasterq-dump • 14k views

ADD COMMENT • link 4.8 years ago by Hamish ▴ 40

score 2 · Answer 1 · 2020-09-11

2

Entering edit mode

4.8 years ago

ATpoint 88k

What is the exact difference? I see three files in both. Size differences can be due to different header names. Start by confirming read numbers are the same in files between tools. Tbh, I'd just use the fast-dump one and get started with analysis.

ADD COMMENT • link 4.8 years ago by ATpoint 88k

0

Entering edit mode

I've checked what you asked using R1 as an example - both files are 85610524 lines long so will have the same amount of reads. The header information is different based off the names of the input file. I've run head on both files and using fasterq-dump, we have an example first read of:

@SRR1293521.1 1 length=83 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC +SRR1293521.1 1 length=83 CCCFFFFFHHHHFIJIIIJIJJJJJIJJJIIGIIIIIIIIGHAHIIIJIGIJJJJJJIJJHFHGFFCCEADCDDBBBA??BB<

Whereas fastq-dump with it's longer input filenames has an example first read of:

@SRR1293521_dbGaP-26006.1 1 length=83 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC +SRR1293521_dbGaP-26006.1 1 length=83 CCCFFFFFHHHHFIJIIIJIJJJJJIJJJIIGIIIIIIIIGHAHIIIJIGIJJJJJJIJJHFHGFFCCEADCDDBBBA??BB<

So you were spot on - the only difference between these two is the length of the header (due to the different input file names) and that would account for why the file sizes are different. Thanks for your insight with this.

ADD REPLY • link 4.8 years ago by Hamish ▴ 40