Truncated metadata file report from ENA Portal API
0
1
Entering edit mode
10 days ago
Giulia • 0

Good morning everyone,

I'm wondering if you could help me understand what's the issue with my code. I already sent a ticket to ENA but I could use a second opinion on this. It's the first time I encounter this issue downloading metadata tsv reports from ENA Portal API, it only happens with very large files (over 100 MB): they are saved to TSV before they are fully downloaded, without showing any error or warning. If I try to download the same file from the browser it's fully downloaded, but it's not what I'm looking for. This is the code:

projectID = "PRJNA43021"
s = rq.session()
retries = Retry(total=5, backoff_factor=0.1, status_forcelist=[429, 500, 502, 503, 504])
s.mount('https://', HTTPAdapter(max_retries=retries))

url = f"https://www.ebi.ac.uk/ena/portal/api/filereport?accession={projectID}&result=read_run&fields=study_accession,secondary_study_accession,sample_accession,secondary_sample_accession,experiment_accession,run_accession,submission_accession,tax_id,scientific_name,instrument_platform,instrument_model,library_name,nominal_length,library_layout,library_strategy,library_source,library_selection,read_count,base_count,center_name,first_public,last_updated,experiment_title,study_title,study_alias,experiment_alias,run_alias,fastq_bytes,fastq_md5,fastq_ftp,fastq_aspera,fastq_galaxy,submitted_bytes,submitted_md5,submitted_ftp,submitted_aspera,submitted_galaxy,submitted_format,sra_bytes,sra_md5,sra_ftp,sra_aspera,sra_galaxy,sample_alias,broker_name,sample_title,nominal_sdev,first_created&format=tsv&download=true&limit=0

headers = {"User-Agent": generate_user_agent()}
download = s.get(url, headers=headers, allow_redirects=True)
with open((os.path.join(path, f'{projectID}_experiments-metadata.tsv')), 'wb') as file:
    file.write(download.content)

I never had any similar issues using this code, but this big file is saved before its complete download. The resultant TSV only has 43.518 rows instead of 69.495, and the last line is truncated like this:

43516   PRJNA46333  SRP002480   SAMN04360132    SRS1312823  SRX1602894  SRR3191631  SRA358002   539655  human skin metagenome   ILLUMINA    Illumina HiSeq 2000 AJD0187_1_L7161124  500 PAIRED  WGS METAGENOMIC RANDOM  4556010 920314020   NISC    02/05/2016  27/06/2016  Illumina HiSeq 2000 paired end sequencing: Illumina Sequencing of MET0782 Metagenomic Paired-end Library    Gene-Environment Interactions at the Skin Surface   phs000266_29    ILLUMINA_AJD0187_1_L7161124 LA79406 357665999;377470120 d9256526cb9236a55dd06b83a85fe085;01109c3803e0decb4d4b06ad3c4e824a   ftp.sra.ebi.ac.uk/vol1/fastq/SRR319/001/SRR3191631/SRR3191631_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR319/001/SRR3191631/SRR3191631_2.fastq.gz   fasp.sra.ebi.ac.uk:/vol1/fastq/SRR319/001/SRR3Read timed out

You can see at the end it says Read timed out. Am I doing something wrong? How can I fully download the file and most importantly be sure it's fully downloaded?

Thank you so much

Giulia Soletta

ena python • 336 views
ADD COMMENT
1
Entering edit mode

Are you missing a " at the end of the URL. Doing

wget -O test.out "https://www.ebi.ac.uk/ena/portal/api/filereport?accession=PRJNA43021&result=read_run&fields=study_accession,secondary_study_accession,sample_accession,secondary_sample_accession,experiment_accession,run_accession,submission_accession,tax_id,scientific_name,instrument_platform,instrument_model,library_name,nominal_length,library_layout,library_strategy,library_source,library_selection,read_count,base_count,center_name,first_public,last_updated,experiment_title,study_title,study_alias,experiment_alias,run_alias,fastq_bytes,fastq_md5,fastq_ftp,fastq_aspera,fastq_galaxy,submitted_bytes,submitted_md5,submitted_ftp,submitted_aspera,submitted_galaxy,submitted_format,sra_bytes,sra_md5,sra_ftp,sra_aspera,sra_galaxy,sample_alias,broker_name,sample_title,nominal_sdev,first_created&format=tsv&download=true&limit=0"

I got 69498 rows,

ADD REPLY
0
Entering edit mode

The " missing is because of a copy-paste error in the post.. I know wget but my tool is entirely in python. I really don't understand why it happens with requests! Thank you anyway :)

ADD REPLY
1
Entering edit mode

While there is no timeout by default on the get operation, you may want to set a higher value and see if that helps, since there is a large amount of data being downloaded.

ADD REPLY
0
Entering edit mode

Thank you so much for your hint! I'll look into this!

ADD REPLY

Login before adding your answer.

Traffic: 1918 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6