Good morning everyone,
I'm wondering if you could help me understand what's the issue with my code. I already sent a ticket to ENA but I could use a second opinion on this. It's the first time I encounter this issue downloading metadata tsv reports from ENA Portal API, it only happens with very large files (over 100 MB): they are saved to TSV before they are fully downloaded, without showing any error or warning. If I try to download the same file from the browser it's fully downloaded, but it's not what I'm looking for. This is the code:
projectID = "PRJNA43021"
s = rq.session()
retries = Retry(total=5, backoff_factor=0.1, status_forcelist=[429, 500, 502, 503, 504])
s.mount('https://', HTTPAdapter(max_retries=retries))
url = f"https://www.ebi.ac.uk/ena/portal/api/filereport?accession={projectID}&result=read_run&fields=study_accession,secondary_study_accession,sample_accession,secondary_sample_accession,experiment_accession,run_accession,submission_accession,tax_id,scientific_name,instrument_platform,instrument_model,library_name,nominal_length,library_layout,library_strategy,library_source,library_selection,read_count,base_count,center_name,first_public,last_updated,experiment_title,study_title,study_alias,experiment_alias,run_alias,fastq_bytes,fastq_md5,fastq_ftp,fastq_aspera,fastq_galaxy,submitted_bytes,submitted_md5,submitted_ftp,submitted_aspera,submitted_galaxy,submitted_format,sra_bytes,sra_md5,sra_ftp,sra_aspera,sra_galaxy,sample_alias,broker_name,sample_title,nominal_sdev,first_created&format=tsv&download=true&limit=0
headers = {"User-Agent": generate_user_agent()}
download = s.get(url, headers=headers, allow_redirects=True)
with open((os.path.join(path, f'{projectID}_experiments-metadata.tsv')), 'wb') as file:
file.write(download.content)
I never had any similar issues using this code, but this big file is saved before its complete download. The resultant TSV only has 43.518 rows instead of 69.495, and the last line is truncated like this:
43516 PRJNA46333 SRP002480 SAMN04360132 SRS1312823 SRX1602894 SRR3191631 SRA358002 539655 human skin metagenome ILLUMINA Illumina HiSeq 2000 AJD0187_1_L7161124 500 PAIRED WGS METAGENOMIC RANDOM 4556010 920314020 NISC 02/05/2016 27/06/2016 Illumina HiSeq 2000 paired end sequencing: Illumina Sequencing of MET0782 Metagenomic Paired-end Library Gene-Environment Interactions at the Skin Surface phs000266_29 ILLUMINA_AJD0187_1_L7161124 LA79406 357665999;377470120 d9256526cb9236a55dd06b83a85fe085;01109c3803e0decb4d4b06ad3c4e824a ftp.sra.ebi.ac.uk/vol1/fastq/SRR319/001/SRR3191631/SRR3191631_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR319/001/SRR3191631/SRR3191631_2.fastq.gz fasp.sra.ebi.ac.uk:/vol1/fastq/SRR319/001/SRR3Read timed out
You can see at the end it says Read timed out. Am I doing something wrong? How can I fully download the file and most importantly be sure it's fully downloaded?
Thank you so much
Giulia Soletta
Are you missing a
"
at the end of the URL. DoingI got 69498 rows,
The " missing is because of a copy-paste error in the post.. I know wget but my tool is entirely in python. I really don't understand why it happens with requests! Thank you anyway :)
While there is no timeout by default on the
get
operation, you may want to set a higher value and see if that helps, since there is a large amount of data being downloaded.Thank you so much for your hint! I'll look into this!