Question

retrieving paired end sequencing data with fasts-dump

0

Entering edit mode

23 months ago

v.berriosfarias ▴ 140

Hello I ned to retrieve 42 fastq files from NCBI SRA: https://www.ncbi.nlm.nih.gov/sra?LinkName=biosample_sra&from_uid=13674977

I retrieve the SRA accession numbers and save them to a file called "SraAccList.txt" which stores the SRA accession numbers to to the sequencing data. The paper methodology mentioned they worked with paired end sequencing data. So I did the following for retrieving the fastq files:

list=$(cat SraAccList.txt)
for accs in $list
do
prefetch $accs
done

then for the retrieved .sra files I used fastq-dump to finally get the paired end reads:

for f in *.sra
do
fastq-dump --split-3 $f
done

but I only got SRR{numbers}.fastq files and not paired end reads files.

In other similar threads there is discussed the fact that it can be the case where the submitters don't provide the full fastq data but I'm not sure if that is my case or the retrieved fastq files are in interleaved format or are just single-end data.

I took a look into the run information page of the 42 SRR accessions and they are labeled as PAIRED sequencing data: https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR10912829

but it seems that definitely they only provided single-end data: SRR info

I compared these submitted SRR to a well published one and the submitters provide both pairs as shown on the green bars:

SRR info2

So it seems the submitters don't provide the complete sequencing data, is this correct?

NCBI fastq sra-toolkit • 984 views

ADD COMMENT • link updated 23 months ago by Istvan Albert 100k • written 23 months ago by v.berriosfarias ▴ 140

1

Entering edit mode

23 months ago

Istvan Albert 100k

I suspect they have mislabeled their data as paired.

 bio search SRR10912829

prints:

[
    {
        "run_accession": "SRR10912829",
        "sample_accession": "SAMN13674977",
        "first_public": "2020-01-20",
        "country": "Antarctica",
        "sample_alias": "Antarctic Polar",
        "fastq_bytes": "4133348227",
        "read_count": "52569045",
        "library_name": "S08-1",
        "library_strategy": "WGS",
        "library_source": "METAGENOMIC",
        "library_layout": "PAIRED",
        "instrument_platform": "ILLUMINA",
        "instrument_model": "Illumina HiSeq 2500",
        "study_title": "The polar microbiota Metagenome",
        "fastq_ftp": "ftp.sra.ebi.ac.uk/vol1/fastq/SRR109/029/SRR10912829/SRR10912829.fastq.gz"
    }
]

note how only a single FASTQ file is provided

ADD COMMENT • link 23 months ago by Istvan Albert 100k

0

Entering edit mode

Didn't know about bio search. really useful.

The submitters uploaded each file separated on each SRR accesion. Thank you!

ADD REPLY • link 23 months ago by v.berriosfarias ▴ 140

score 2 · Accepted Answer · 2022-05-01

2

Entering edit mode

23 months ago

GenoMax 141k

If this is indeed paired-end data as described in the paper then it is unfortunate that the submitters appear to have submitted individual paired end data files as separate runs. LINK for Run Browser You can confirm that by comparing the library names (as marked below) and checking if you can relate that to the publication.

screenshot

ADD COMMENT • link 23 months ago by GenoMax 141k

0

Entering edit mode

I didn't see the run browser!. Definitely they uploaded each file separated on each SRR accession. Thank you so much.