Question: SRA Toolkit: using fastq-dump vs fasterq-dump - discrepancies in output?
1
gravatar for Hamish
6 weeks ago by
Hamish20
Hamish20 wrote:

Hi

I am trying to get paired-end fastqs from a number of dbgap-restricted SRA files and am unsure if my output files are correct. Basically the process I've followed it to use SRA Toolkit (version 10.8.3 running on Ubuntu) to prefetch the files, validate the download with vdb-validate, and then convert this .sra into fastq. I have used both fasterq-dump and fastq-dump to achieve this and my output fastq files from each are of different sizes.

The steps I'm taking are as follows using SRR1293521 as an example:

  1. Download SRA file:

    ./prefetch --ngc prj_26006.ngc SRR1293521
    

    This succeeds with no errors

  2. Validate SRA download

    ./vdb-validate --ngc prj_26006.ngc SRR1293521/SRR1293521_dbGaP-26006.sra
    

    All validation tests are passed.

  3. Convert to fastq with fasterq-dump: I first make a copy of SRR1293521_dbGaP-26006.sra and rename the file SRR1293521 because it fails with the default name.

    ./fasterq-dump --ngc prj_26006.ngc SRR1293521/SRR1293521
    

    Output:

    spots read : 99,531,818

    reads read : 199,063,636

    reads written : 120,934,449

    Resulting in 3 files:

    SRR1293521_1.fastq 5.3GB

    SRR1293521_2.fastq 5.3GB

    SRR1293521.fastq 19.4GB

  4. Convert to fastq with fastq-dump: I use split-e here because instead of split-3 because it's a typo in the current codebase. and I use --skip-technical because according to this page, that should make this command functionally identical to the above fasterq-dump command.

    ./fastq-dump --split-e --skip-technical --ngc prj_26006.ngc SRR1293521/SRR1293521_dbGaP-26006.sra
    

    Output:

    Rejected 78129187 READS because READLEN < 1

    Read 99531818 spots for SRR1293521/SRR1293521_dbGaP-26006.sra

    Written 99531818 spots for SRR1293521/SRR1293521_dbGaP-26006.sra

    Resulting in 3 files:

    SRR1293521_dbGaP-26006_1.fastq 5.8GB

    SRR1293521_dbGaP-26006_2.fastq 5.8GB

    SRR1293521_dbGaP-26006.fastq 21.3GB

Is it expected to get different output from what I assumed were these functionally equivalent commands? If so, how do I know which fastq is the correct one? Usually I would download the raw fastq from ebi to cross-check but because it's a protected file this option isn't available. Also, would --split-files (resulting in 2 fastqs) be more suited than --split-e for this file?

Any suggestions would be much appreciated!

ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by Hamish20
1
gravatar for ATpoint
6 weeks ago by
ATpoint40k
Germany
ATpoint40k wrote:

What is the exact difference? I see three files in both. Size differences can be due to different header names. Start by confirming read numbers are the same in files between tools. Tbh, I'd just use the fast-dump one and get started with analysis.

ADD COMMENTlink written 6 weeks ago by ATpoint40k

I've checked what you asked using R1 as an example - both files are 85610524 lines long so will have the same amount of reads. The header information is different based off the names of the input file. I've run head on both files and using fasterq-dump, we have an example first read of:

@SRR1293521.1 1 length=83 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC +SRR1293521.1 1 length=83 CCCFFFFFHHHHFIJIIIJIJJJJJIJJJIIGIIIIIIIIGHAHIIIJIGIJJJJJJIJJHFHGFFCCEADCDDBBBA??BB<

Whereas fastq-dump with it's longer input filenames has an example first read of:

@SRR1293521_dbGaP-26006.1 1 length=83 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC +SRR1293521_dbGaP-26006.1 1 length=83 CCCFFFFFHHHHFIJIIIJIJJJJJIJJJIIGIIIIIIIIGHAHIIIJIGIJJJJJJIJJHFHGFFCCEADCDDBBBA??BB<

So you were spot on - the only difference between these two is the length of the header (due to the different input file names) and that would account for why the file sizes are different. Thanks for your insight with this.

ADD REPLYlink written 6 weeks ago by Hamish20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 848 users visited in the last hour