SRA Toolkit: using fastq-dump vs fasterq-dump - discrepancies in output?
1
3
Entering edit mode
3.6 years ago
Hamish ▴ 40

Hi

I am trying to get paired-end fastqs from a number of dbgap-restricted SRA files and am unsure if my output files are correct. Basically the process I've followed it to use SRA Toolkit (version 10.8.3 running on Ubuntu) to prefetch the files, validate the download with vdb-validate, and then convert this .sra into fastq. I have used both fasterq-dump and fastq-dump to achieve this and my output fastq files from each are of different sizes.

The steps I'm taking are as follows using SRR1293521 as an example:

  1. Download SRA file:

    ./prefetch --ngc prj_26006.ngc SRR1293521
    

    This succeeds with no errors

  2. Validate SRA download

    ./vdb-validate --ngc prj_26006.ngc SRR1293521/SRR1293521_dbGaP-26006.sra
    

    All validation tests are passed.

  3. Convert to fastq with fasterq-dump: I first make a copy of SRR1293521_dbGaP-26006.sra and rename the file SRR1293521 because it fails with the default name.

    ./fasterq-dump --ngc prj_26006.ngc SRR1293521/SRR1293521
    

    Output:

    spots read : 99,531,818

    reads read : 199,063,636

    reads written : 120,934,449

    Resulting in 3 files:

    SRR1293521_1.fastq 5.3GB

    SRR1293521_2.fastq 5.3GB

    SRR1293521.fastq 19.4GB

  4. Convert to fastq with fastq-dump: I use split-e here because instead of split-3 because it's a typo in the current codebase. and I use --skip-technical because according to this page, that should make this command functionally identical to the above fasterq-dump command.

    ./fastq-dump --split-e --skip-technical --ngc prj_26006.ngc SRR1293521/SRR1293521_dbGaP-26006.sra
    

    Output:

    Rejected 78129187 READS because READLEN < 1

    Read 99531818 spots for SRR1293521/SRR1293521_dbGaP-26006.sra

    Written 99531818 spots for SRR1293521/SRR1293521_dbGaP-26006.sra

    Resulting in 3 files:

    SRR1293521_dbGaP-26006_1.fastq 5.8GB

    SRR1293521_dbGaP-26006_2.fastq 5.8GB

    SRR1293521_dbGaP-26006.fastq 21.3GB

Is it expected to get different output from what I assumed were these functionally equivalent commands? If so, how do I know which fastq is the correct one? Usually I would download the raw fastq from ebi to cross-check but because it's a protected file this option isn't available. Also, would --split-files (resulting in 2 fastqs) be more suited than --split-e for this file?

Any suggestions would be much appreciated!

software error sratools fastq-dump fasterq-dump • 11k views
ADD COMMENT
2
Entering edit mode
3.6 years ago
ATpoint 81k

What is the exact difference? I see three files in both. Size differences can be due to different header names. Start by confirming read numbers are the same in files between tools. Tbh, I'd just use the fast-dump one and get started with analysis.

ADD COMMENT
0
Entering edit mode

I've checked what you asked using R1 as an example - both files are 85610524 lines long so will have the same amount of reads. The header information is different based off the names of the input file. I've run head on both files and using fasterq-dump, we have an example first read of:

@SRR1293521.1 1 length=83 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC +SRR1293521.1 1 length=83 CCCFFFFFHHHHFIJIIIJIJJJJJIJJJIIGIIIIIIIIGHAHIIIJIGIJJJJJJIJJHFHGFFCCEADCDDBBBA??BB<

Whereas fastq-dump with it's longer input filenames has an example first read of:

@SRR1293521_dbGaP-26006.1 1 length=83 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC +SRR1293521_dbGaP-26006.1 1 length=83 CCCFFFFFHHHHFIJIIIJIJJJJJIJJJIIGIIIIIIIIGHAHIIIJIGIJJJJJJIJJHFHGFFCCEADCDDBBBA??BB<

So you were spot on - the only difference between these two is the length of the header (due to the different input file names) and that would account for why the file sizes are different. Thanks for your insight with this.

ADD REPLY

Login before adding your answer.

Traffic: 2176 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6