Question

fasterq-dump downloads two or three files from a single-end run

0

Entering edit mode

2.1 years ago

FGV ▴ 170

I've been trying to use fasterq-dump to download some single end runs, but I get 2 and sometimes 3 files.

For example, run SRR12920588:

fasterq-dump --progress --skip-technical --force SRR12920588

gives:

SRR12920588_1.fastq
SRR12920588_2.fastq

and run SRR17115876:

$ fasterq-dump --progress --skip-technical --force SRR17115876

gives:

SRR17115876_1.fastq
SRR17115876_2.fastq
SRR17115876.fastq

Shouldn't I be getting just one file for each? Any idea why?

fasterq-dump • 543 views

ADD COMMENT • link updated 2.1 years ago by Istvan Albert 100k • written 2.1 years ago by FGV ▴ 170

1

Entering edit mode

I think it is paired end data, for SRR12920588 it says layout is paired-end. For SRR17115876 though the layout is mentioned as single, if you look at the reads in the run browser it says Reads(joined) and there is an option to view separate reads (looks like paired-end to me).

ADD REPLY • link 2.1 years ago by vk ▴ 40

score 1 · Answer 1 · 2022-03-25

the easiest way to quickly check the info is

bio search SRR12920588

it tells us that:

[
    {
        "run_accession": "SRR12920588",
        "sample_accession": "SAMN16578971",
        "first_public": "2021-01-04",
        "country": "",
        "sample_alias": "ChIP polE TC H3K56Q_2 5",
        "fastq_bytes": "7406752;6470603",
        "read_count": "206418",
        "library_name": "ChIP polE TC H3K56Q_2 5",
        "library_strategy": "ChIP-Seq",
        "library_source": "GENOMIC",
        "library_layout": "PAIRED",
        "instrument_platform": "ILLUMINA",
        "instrument_model": "Illumina NovaSeq 6000",
        "study_title": "Rtt109 effect on replication speed via histone acetylation",
        "fastq_ftp": "ftp.sra.ebi.ac.uk/vol1/fastq/SRR129/088/SRR12920588/SRR12920588_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR129/088/SRR12920588/SRR12920588_2.fastq.gz"
    }
]

now we know that it is paired-end, which means we get at least two, but possibly more files, sometimes even the sample indices are also included.

Now let the bioinformatics begin:

fastq-dump -X 10 SRR12920588

seqkit stats SRR12920588.fastq

will print:

file               format  type  num_seqs  sum_len  min_len  avg_len  max_len
SRR12920588.fastq  FASTQ   DNA         10    1,020      102      102      102

Look like we got a single file called SRR12920588.fastq with 10 records , where each sequence is 102bp long. The paired reads are concatenated into a single long sequence.

If we were to passing the --split-spot flag to the same command:

 fastq-dump -X 10 --split-spot SRR12920588
 seqkit stats SRR12920588.fastq

the results will be:

file               format  type  num_seqs  sum_len  min_len  avg_len  max_len
SRR12920588.fastq  FASTQ   DNA         20    1,020       51       51       51

now we got a single file called SRR12920588.fastq but this time it has 20 records, where each sequence is 51bp long. The paired reads follow one another (so-called interleaved format)

But them if we were to pass the --split-files flag

seqkit stats SRR12920588_*

the command will now produce two files, each with 10 reads and each with that are 51bp long.

file                 format  type  num_seqs  sum_len  min_len  avg_len  max_len
SRR12920588_1.fastq  FASTQ   DNA         10      510       51       51       51
SRR12920588_2.fastq  FASTQ   DNA         10      510       51       51       51

you can also pass the --split-3 flag, but and that might produce more files, but not in this case.

As a rule, run read statistics to understand what is inside your file.