I came across the below NCBI website with instructions to access the SRA dataset through AWS, but it seems the AMI they reference no longer exists. Anyone have a lead for how the SRA dataset can be accessed via AWS, or more specifically, how one might access data held on s3 buckets?
Here is what I got from SRA about a week ago. In short, you'll likely need an AWS or GCP account and you may need to pay download costs unless you are using the data in the cloud. Full documentation is apparently being prepared.
The SRA Toolkit is needed for ETL data and the default toolkit configuration enables it to find and retrieve SRA runs by accession.
You can use the SRA Run Selector with Study, Sample, or Experiment accessions or an Entrez search to select a list of interesting SRA runs.
Many files are also available in either the Google Cloud Platform (GCP) or Amazon Web Services (AWS) but may require the user to have an account with that provider to access the files and pay egress charges to access the data outside of that cloud provider's platform.
The Free Egress column describes where the data can be accessed without an egress charge. Worldwide - This data can be downloaded from anywhere without paying an egress fee. s3.us-east-1 - This data is free to access for machines running in Amazon's us-east-1 region, all other regions or transport outside of Amazon will require paying egress charges.
Access Type describes whether a user account is necessary for data access or if the data can be accessed anonymously.
Primary ETL The file format that has been traditionally distributed from SRA and used by the SRA Toolkit to read or output into formats like FASTQ, SAM, etc. This data is normalized during the extract, transform, and load (ETL) process at SRA.
Original The source data that was submitted to SRA and has not gone through the ETL process. These files may require specific software to open and read.
Analysis (previously called Secondary ETL) These files are a further analysis of the data available in the run, but may not be present for all runs. May include items like realignments, wgMLST, VCF, etc.