Question: Download SAM/BAM files from SRA takes ages!!!
1
gravatar for Alejandro Jimenez Sanchez
23 months ago by
Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK

Dear all,

I have tried

 sam-dump SRR3330607 | samtools view -bS - > 42RF.bam

and

 sam-dump SRR3330607 > SRR3330607.sam

And it is taking hours to download, is there a more efficient way to download these files from SRA?

I want to use the bam files, and I have many other samples to download.

Thanks for your help. Alejandro

bam sra • 2.8k views
ADD COMMENTlink modified 7 months ago by shengwei30 • written 23 months ago by Alejandro Jimenez Sanchez120

Are you able to get use fastq files (which you can then align yourself)? If so get them from EBI-ENA example.

ADD REPLYlink written 23 months ago by genomax68k

It's quite likely that what you're getting is an unaligned BAM file, which is largely useless.

ADD REPLYlink written 23 months ago by Devon Ryan90k

You can easily check fo alignment information in the sra run browser.

I had a conversation with the SRA team once where they explained to me that they really optimized SRA for generating FASTQ and running BLAST queries and not for generating SAM. I’ve noticed that SAM dump is usually slower than I’d like, but if you're truly getting alignment info I’m sure you’re saving time over aligning the FASTQ.

ADD REPLYlink written 23 months ago by Matt Shirley9.0k

Nice, I'd never noticed the alignment window there before (it's unfortunate that this dataset used NCBI "chromosome names"). I guess I've never been trusting enough of what other people did to want to actually use their alignments...

ADD REPLYlink written 23 months ago by Devon Ryan90k

Maybe you can try aspera.

ADD REPLYlink written 23 months ago by ghostforever.shi50
1
gravatar for Philipp Bayer
23 months ago by
Philipp Bayer6.1k
Australia/Perth/UWA
Philipp Bayer6.1k wrote:

To download SRA files I always use ascp, there's a manual here

It's ridiculously fast (the example command has a bandwith request of 100Mb/s, but I've used 400Mb/s before, depends on your local setup), then you can dump the fastq from the downloaded .sra file using the toolkit's fastq-dump --split-3)

ADD COMMENTlink written 23 months ago by Philipp Bayer6.1k

I've been looking for a way to increase the bandwidth beyond the usual 100Mb/s, how did you do it?

ADD REPLYlink written 23 months ago by ATpoint17k
1

Do you have more than 100Mb/s available? Aspera will happily use all the bandwidth it can lay its hands on (up to 10 Gbps) as long as the source supports it (NCBI does).

ADD REPLYlink modified 23 months ago • written 23 months ago by genomax68k
2

I once broke the universities connection to backbone when downloading many files simultaneously on unthrottled ascp.

ADD REPLYlink written 23 months ago by i.sudbery4.8k

Uni of Sheffield?

ADD REPLYlink written 12 months ago by Kevin Blighe43k

No, Uni where I was a postdoc.

ADD REPLYlink written 12 months ago by i.sudbery4.8k

If you have aspera installed on your system, newer versions of SRA toolkit's prefetch command will automatically do an ascp transfer: https://github.com/ncbi/sra-tools/wiki/HowTo:-Access-SRA-Data

ADD REPLYlink written 23 months ago by Matt Shirley9.0k
1

My biggest problem with this is that for at least fastq data, the rate limiting step is generally the dump rather than the actual download.

ADD REPLYlink written 22 months ago by i.sudbery4.8k

Nothing you can do about it. If you have access to a SSD, it will speed up things but fastq-dump will always be slow. Especially on GPFS, where the random access slows down the system a lot If your file system is already slow in the first place, you will have a hard time. See if the data are mirrored at the European Nucleotide Archive ENA, which also supports Aspera download of fastq instead of sra.

ADD REPLYlink modified 21 months ago • written 22 months ago by ATpoint17k

I don't think random access is the problem. The SRA format is a column-oriented database, so there should be very little seeking when you're dumping FASTQ. I think the problem is in dumping SAM format you're encountering a slowdown because the SAM fields (alignment information) are not retrieved as efficiently as the sequence and quality scores.

ADD REPLYlink written 22 months ago by Matt Shirley9.0k

The solution is to use ENA rather than SRA - everything apart from the controlled access stuff is mirrored accross and ENA store the raw fastq, which can be downloaded directly by ascp.

ADD REPLYlink written 22 months ago by i.sudbery4.8k

Is there a solution to this though?

ADD REPLYlink written 21 months ago by nro0
1

You can dump “spots” 1 through n using one process, and n through k using another process. Basically run fastq-dump on the same SRA archive but exporting a different chunk of the fastq file. This will scale until you run out of disk IO or CPU threads.

ADD REPLYlink written 21 months ago by Matt Shirley9.0k

I never thought of that.... what a great idea!

ADD REPLYlink written 21 months ago by i.sudbery4.8k

Did you check if the data are mirrored at the ENA?

ADD REPLYlink written 21 months ago by ATpoint17k
1
gravatar for sutturka
15 months ago by
sutturka150
USA
sutturka150 wrote:

I would like to share my twist and turns with SRA download. It took several interactions with NCBI staff to figure this out. below are the steps:

Prerequisite:
sra-toolkit and aspera plugin installed. The instructions are specific to Linux environment.

Steps:

  1. Configure workspace The workspace for downloading the SRA data must be cached. Although SRA-toolkit is installed centrally, this need to be set manually for every user. Please follow this link and navigate to the section Configuring the Toolkit.

Assuming sra-toolkit is installed or loaded, run the following command and complete setup as mentioned in the link.

vdb-config -i

  1. Download SRA file

prefetch -X 200G SRR2095320 -a "/depot/bioinfo/apps/apps/aspera-connect-3.1.1.70545/bin/ascp|/depot/bioinfo/apps/apps/aspera-connect-3.1.1.70545/etc/asperaweb_id_dsa.putty"

where "-a" specify the path for the aspera binary and private key file. Prefetch will download the SRA data as well as all needed references to your local cache. This prevents sending multiple requests to NCBI servers and save substantial time.

  1. Demultiplex with sra-toolkit

fastq-dump -I --split-files ./SRR2095320.sra

I found above approach extremely fast. using this approach 44GB file for SRR2095320.sra was downloaded, prefetched and converted to (151 GB R1 + 151 GB R2) data in about 25 hours using 10 processors. While using only the standard fastq-dump in >70 hours it could only convert (80 GB R1 + 80 GB R2) and job failed because of the connection issue.

ADD COMMENTlink modified 15 months ago • written 15 months ago by sutturka150
0
gravatar for smho
23 months ago by
smho40
Melbourne, Australia
smho40 wrote:

If you are willing to download the FASTQ files instead and run the alignment yourself, here is a nice tutorial for using fastq-dump correctly:

https://edwards.sdsu.edu/research/fastq-dump/

You can even select to download as FASTA (without quality scores) with --fasta option to reduce the download volume; however, probably not much recommended! :-)

ADD COMMENTlink written 23 months ago by smho40
0
gravatar for shengwei
7 months ago by
shengwei30
shengwei30 wrote:

check here for an example using Aspera Connect (ascp):

  1. To download

    prefetch --max-size 100G --transport ascp --ascp-path "/path/to/aspera/3.6.2/bin/ascp|/path/to/aspera/3.6.2/etc/asperaweb_id_dsa.openssh" --output-file $OUTPUT_DIR/$SRA_ID.sra $SRA_ID

  2. To extract

    fastq-dump --split-files --origfmt --gzip $OUTPUT_DIR/$SRA_ID.sra

ADD COMMENTlink written 7 months ago by shengwei30
1

This solution has been posted and upvoted in this thread twice already ;-)

ADD REPLYlink written 7 months ago by ATpoint17k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1186 users visited in the last hour