Question: NCBI SRA AWS AMI
3
gravatar for joe
14 months ago by
joe140
joe140 wrote:

I came across the below NCBI website with instructions to access the SRA dataset through AWS, but it seems the AMI they reference no longer exists. Anyone have a lead for how the SRA dataset can be accessed via AWS, or more specifically, how one might access data held on s3 buckets?

https://www.ncbi.nlm.nih.gov/sra/docs/sra-aws-download/

s3 aws sra ncbi • 1.8k views
ADD COMMENTlink modified 13 months ago by hermidalc0 • written 14 months ago by joe140
1

While someone may respond I suggest that you send an official ticket in to NCBI using this form. Use "Write to the Help Desk" button on right. Please update this thread when you hear back from them.

I sent a ticket in to see if SRA google/amazon bucket links were available for public use. Will post an update in other thread you have.

ADD REPLYlink modified 14 months ago • written 14 months ago by genomax91k
1

NCBI SRA support indicated that the cloud services are not ready for public use (as of early August 2019). Some data downloads will requirement payment, some not. Public announcement about cloud services will be coming in near future.

ADD REPLYlink written 14 months ago by genomax91k

Thanks, I'm still bouncing emails back and forth between NLM. They were confused why their AMI wasn't available, but it seems like it's just a matter of days/weeks until everything is up and running.

ADD REPLYlink written 14 months ago by joe140

Maybe I'm doing something wrong but I've configured the AWS CLI on my local computer with appropriate us-east-1 region and tried to download an SRA file:

aws s3 cp s3://sra-pub-run-3/SRR292241/SRR292241.3 SRR292241.sra

It says it's forbidden (403). From this thread it seems to me that easy downloading from AWS is something they want to provide. I should be allowed without needing to spin up or log into an EC2 instance. I just want a better/faster way to get .sra files, using prefetch can be problematic, i.e. download failures, long pauses in the download, and very slow/erratic download speeds.

ADD REPLYlink written 13 months ago by hermidalc0
1

Do yourself a favor and download directly in fastq format from ENA: Fast download of FASTQ files from the European Nucleotide Archive (ENA)

(only works for non-restricted data, most datasets are non-restricted)

ADD REPLYlink written 13 months ago by ATpoint40k

ATpoint do you have any advice on my post on that thread? A: Fast download of FASTQ files from the European Nucleotide Archive (ENA)

ADD REPLYlink modified 13 months ago • written 13 months ago by hermidalc0
1

hermidalc : Per conversation with NCBI support AWS/google buckets are NOT available for public use as yet (Sept 2019) even though the links have started appearing in SRA records.

They are supposed to go live later this year after an announcement is made by NCBI.

ADD REPLYlink modified 13 months ago • written 13 months ago by genomax91k
3
gravatar for Sean Davis
14 months ago by
Sean Davis26k
National Institutes of Health, Bethesda, MD
Sean Davis26k wrote:

Here is what I got from SRA about a week ago. In short, you'll likely need an AWS or GCP account and you may need to pay download costs unless you are using the data in the cloud. Full documentation is apparently being prepared.

The SRA Toolkit is needed for ETL data and the default toolkit configuration enables it to find and retrieve SRA runs by accession.

You can use the SRA Run Selector with Study, Sample, or Experiment accessions or an Entrez search to select a list of interesting SRA runs.

Many files are also available in either the Google Cloud Platform (GCP) or Amazon Web Services (AWS) but may require the user to have an account with that provider to access the files and pay egress charges to access the data outside of that cloud provider's platform.

The Free Egress column describes where the data can be accessed without an egress charge. Worldwide - This data can be downloaded from anywhere without paying an egress fee. s3.us-east-1 - This data is free to access for machines running in Amazon's us-east-1 region, all other regions or transport outside of Amazon will require paying egress charges.

Access Type describes whether a user account is necessary for data access or if the data can be accessed anonymously.

Primary ETL The file format that has been traditionally distributed from SRA and used by the SRA Toolkit to read or output into formats like FASTQ, SAM, etc. This data is normalized during the extract, transform, and load (ETL) process at SRA.

Original The source data that was submitted to SRA and has not gone through the ETL process. These files may require specific software to open and read.

Analysis (previously called Secondary ETL) These files are a further analysis of the data available in the run, but may not be present for all runs. May include items like realignments, wgMLST, VCF, etc.

ADD COMMENTlink written 14 months ago by Sean Davis26k
1

you'll likely need an AWS or GCP account and you may need to pay download costs unless you are using the data in the cloud

That will lock out a number of users. I assume some form of free access will remain (e.g. sratoolkit) otherwise we will need to depend on ENA to provide fastq data.

ADD REPLYlink written 14 months ago by genomax91k

I believe that SRA toolkit will remain as an option, at least for the foreseeable future. I know of no SRA datasets that are available in the cloud only as of now.

ADD REPLYlink written 14 months ago by Sean Davis26k

I'm having an issue with getting some fastqs that are only available in the cloud. If you go to: https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR10620024 and click on the Data access tab, there are some aws s3 links in a section title Original format to files that I want.

There are problems with the files downloadable by fastq-dump. SRP235541 is the accession and it is annotated as being paired end, but using fasterq-dump results in a single fastq file per run.

ADD REPLYlink modified 27 days ago • written 27 days ago by mcsimenc10
1

You can download this data file and then save with .sra extension. Then you can dump the reads out using

fastq-dump -F --split-files SRR10620024.sra

I got the expected 3 files with 8,26 and 98 bp reads.

::::::::::::::
SRR10620024_1.fastq
::::::::::::::
@K00335:298:HYCLMBBXX:7:1101:1550:1543
TCAGCCGT
+K00335:298:HYCLMBBXX:7:1101:1550:1543
A--AF-AA

::::::::::::::
SRR10620024_2.fastq
::::::::::::::
@K00335:298:HYCLMBBXX:7:1101:1550:1543
CAACCTCCAAAGGCGTAACTTTACAA
+K00335:298:HYCLMBBXX:7:1101:1550:1543
AAAFFJJJJJ<7<A<F7JJFJJJA<A

::::::::::::::
SRR10620024_3.fastq
::::::::::::::
@K00335:298:HYCLMBBXX:7:1101:1550:1543
GATNGCAGAATATGGAGTCATTATTAGAGACTAAGACGCTATGTATAGATGCACAAAGGATGGAGTCGCTCTGGTCTACACAAAGGTAAGAATTTTCC
+K00335:298:HYCLMBBXX:7:1101:1550:1543
A7-#7FA-AF--77-<7-----7A77----<AAJ-FJ----7-77--7----<-77---7<-AJ7-----7A----7-77AA----7---7---77--
ADD REPLYlink modified 27 days ago • written 27 days ago by genomax91k

Thanks! I guess I'll need some patience. Here are two interesting documents/webpages I found while searching for an answer.

https://www.nlm.nih.gov/news/NLM_Moves_SRA_Cloud.html https://datascience.nih.gov/sites/default/files/NIH_Strategic_Plan_for_Data_Science_Final_508.pdf

ADD REPLYlink written 14 months ago by joe140
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1474 users visited in the last hour