Question

Issues in downloading read sets — sra fastq-dump

0

Entering edit mode

2.2 years ago

Matteo Ungaro ▴ 100

Hi all,

I'm having some troubles to understand why I cannot download some read sets form a batch I launch using a script that invokes sra fastq-dump. It seems like it cannot connect to the NCBI, despite the code reported is correct. Does anyone have an idea? Following, a quote of the script:

> #!/bin/bash
> #
> #SBATCH --nodes=1 --ntasks=2 --cpus-per-task=24
> #SBATCH --time=24:00:00
> #SBATCH --mem=350gb
> #
> #SBATCH --job-name=SAS-EUR_populations
> #SBATCH --output=SAS-EUR_individuals.out
> #
> #SBATCH --partition=g100_usr_smem
> #SBATCH --account=IscrC_PanSV
> 
> module load profile/bioinf sra/2.9.6
> 
> cd /g100_work/IscrC_PanSV/NA20847
> 
> fastq-dump --gzip SRR13606073 
> fastq-dump --gzip SRR13606074
> 
> cd /g100_work/IscrC_PanSV/NA20509
> 
> fastq-dump --gzip SRR13606071
> fastq-dump --gzip SRR13606072

Sorry about the format but each fastq-dump is on a separate, new line as well as the #SBATCH. The interface outputs the quote in this strange format.

Thanks in advance,

Matteo

script cluster sra fastq-dump • 1.0k views

ADD COMMENT • link updated 2.2 years ago by ATpoint 82k • written 2.2 years ago by Matteo Ungaro ▴ 100

0

Entering edit mode

Save yourself the interaction with that terrible tool and use https://sra-explorer.info/ to get fastq download links directly. In my experience fastq-dump is rather unstable and experiences connection losses rather frequently.

ADD REPLY • link 2.2 years ago by ATpoint 82k

0

Entering edit mode

Thanks a lot I'll have a look at that! But the problem seems I cannot work with wget because the thin nodes and the thick nodes for that cluster architecture do not have access to internet connection... So, I'm somehow forced to sra

ADD REPLY • link 2.2 years ago by Matteo Ungaro ▴ 100

0

Entering edit mode

Sorry I do not understand. If you have no internet, then how can you access ncbi via the toolkit?

ADD REPLY • link 2.2 years ago by ATpoint 82k

0

Entering edit mode

Somehow I can use the toolkit on thin and thick nodes but not the wget command that abruptly stopped without downloading anything... So, I resorted to sra, which as you said is quite unstable.

I experimented with wget on the login nodes, which based on what the user support told me are the only ones connected to the web; however, the problem is that those nodes have a wall-time of 4h, and some of the files take longer time to download.

I think the answer might lie in what GenoMax said below that is the sequences are very "new" and I might pass through the .bam in order to get the .fastq.

ADD REPLY • link 2.2 years ago by Matteo Ungaro ▴ 100

1

Entering edit mode

If the sra is really the only way to go then try to download the SRA file first with prefetch and then convert to fastq with fastq-dump locally rather than downloading with fastq-dump directly. sra-explorer also offers aspera download links which you can try, also maybe see if curl works better. Maybe it is just a server issue at NCBI with their ftp servers right now.

ADD REPLY • link 2.2 years ago by ATpoint 82k

score 2 · Answer 1 · 2022-03-03

2

Entering edit mode

2.2 years ago

GenoMax 142k

That appears to be a brand new PacBio dataset that was just published late last month.

It may be best to visit the SRA run page and then click on "Data access" tab. Submitters have provided original BAM files for this run. Download those (they are available from the cloud URL's posted there at this time, they will eventually require a cloud account once they move to cold storage) and then convert them to fastq using bam2fastx (LINK) utility provided by PacBio.

ADD COMMENT • link 2.2 years ago by GenoMax 142k

0

Entering edit mode

Thanks I was actually thinking to do so, I even looked at bedtools to convert .bam in .fastq. However, it seems more logical to use the utility specifically provided by PacBio.

ADD REPLY • link 2.2 years ago by Matteo Ungaro ▴ 100