Data management duties have lapsed in my lab. I'm trying to identify files in our systems that have been published to SRA.
I have a hash value for every file in our system, if I can download the exact file that was upload to SRA, I can get the hash value and cross check for duplicates. However, when files are uploaded to SRA, they are transformed into SRA objects from which you get the sequence using SRA toolkits fastq-dump.
Downloading the fastq using this tool yields a file of a different size from the originally uploaded file.
Are there certain command line options for fastq-dump I can specify to regenerate the exact file that was uploaded?
Thanks.
I saw those (was using them for the size estimates mentioned). I didn't think they would be available for download via the s3 address, but I'll try it.
Thanks again.
Edit: following up
You have to use the SRA's cloud delivery service. You cannot download the original files from SRA's S3 bucket via AWS CLI.