retrieving paired end sequencing data with fasts-dump
2
0
Entering edit mode
9 weeks ago

Hello I ned to retrieve 42 fastq files from NCBI SRA: https://www.ncbi.nlm.nih.gov/sra?LinkName=biosample_sra&from_uid=13674977

I retrieve the SRA accession numbers and save them to a file called "SraAccList.txt" which stores the SRA accession numbers to to the sequencing data. The paper methodology mentioned they worked with paired end sequencing data. So I did the following for retrieving the fastq files:

list=$(cat SraAccList.txt) for accs in$list
do
prefetch $accs done  then for the retrieved .sra files I used fastq-dump to finally get the paired end reads: for f in *.sra do fastq-dump --split-3$f
done


but I only got SRR{numbers}.fastq files and not paired end reads files.

In other similar threads there is discussed the fact that it can be the case where the submitters don't provide the full fastq data but I'm not sure if that is my case or the retrieved fastq files are in interleaved format or are just single-end data.

I took a look into the run information page of the 42 SRR accessions and they are labeled as PAIRED sequencing data: https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR10912829

but it seems that definitely they only provided single-end data:

I compared these submitted SRR to a well published one and the submitters provide both pairs as shown on the green bars:

So it seems the submitters don't provide the complete sequencing data, is this correct?

NCBI fastq sra-toolkit • 399 views
2
Entering edit mode
9 weeks ago
GenoMax 117k

If this is indeed paired-end data as described in the paper then it is unfortunate that the submitters appear to have submitted individual paired end data files as separate runs. LINK for Run Browser You can confirm that by comparing the library names (as marked below) and checking if you can relate that to the publication.

0
Entering edit mode

I didn't see the run browser!. Definitely they uploaded each file separated on each SRR accession. Thank you so much.

0
Entering edit mode

nice observation GenoMax in bioinformatics we have to expect the unexpected

1
Entering edit mode
9 weeks ago

I suspect they have mislabeled their data as paired.

 bio search SRR10912829


prints:

[
{
"run_accession": "SRR10912829",
"sample_accession": "SAMN13674977",
"first_public": "2020-01-20",
"country": "Antarctica",
"sample_alias": "Antarctic Polar",
"fastq_bytes": "4133348227",
"library_name": "S08-1",
"library_strategy": "WGS",
"library_source": "METAGENOMIC",
"library_layout": "PAIRED",
"instrument_platform": "ILLUMINA",
"instrument_model": "Illumina HiSeq 2500",
"study_title": "The polar microbiota Metagenome",
"fastq_ftp": "ftp.sra.ebi.ac.uk/vol1/fastq/SRR109/029/SRR10912829/SRR10912829.fastq.gz"
}
]


note how only a single FASTQ file is provided

0
Entering edit mode

Didn't know about bio search. really useful.

The submitters uploaded each file separated on each SRR accesion. Thank you!