Question

Selecting query format for repeated calls to NCBI's API

1

Entering edit mode

8 months ago

LauferVA 4.5k

Hi Biostars,

I need to make some fairly extensive calls for RNA-Seq records in SRA, then annotate them with metadata, and I'd like to do this the most efficient way. I'll provide some relevant background information first, then try to ask a specific question relating to strategic query design...

Background:

At present, I am most interested project-level metadata. For each SRA record, I would like to obtain: the BioProject Accession, the BioSample Accession, the Center ID (e.g. GSE___ or what have you) the name, title and description of the project the organism name, the organism strain if relevant sequencing info, including the technology type (Illumina, IonTorrent) the sequencer type (HiSeq 2500, NovoSeq 6000, etc), library info (single end paired end, what is targeted (e.g. AAAA tail)).

To do this efficiently without wasting NCBI's resources, I have been studying online documentation of NCBI's API as well as the (extensive!) capabilities of esearch, elink, and efetch. I have also looked into NCBI's API query rate and max record limits (https://www.ncbi.nlm.nih.gov/books/NBK25497/, "Minimizing Number of Requests"; as well as recommendations for implementation https://www.ncbi.nlm.nih.gov/books/NBK25498, "Application 3: Retrieving Large Datasets”).

I do NOT have questions about: getting an API key, batch queries, or running queries concurrently without exceeding API call thresholds - I feel OK about implementing that.

Question:

At this time, I have 2 working query formats, but I am having trouble evaluating which is likely to be more performant for a large number of records. Part of this relates to due to differences in XML/JSON/TXT output (I am not sure when you can receive each, to be honest), and part relates to a more general conceptual question - Do you have recommendations for which approach, below, will have better complexity and data retrieval speed?

Option A - Starting with SRA: Querying SRA directly seems intuitive because RNA-seq samples are stored in this database. SRA entries are linked to BioSample and BioProject databases - from which metadata can be obtained. I think this approach could be more efficient if the SRA database's schema readily provides links to the specific BioProject and BioSample IDs needed, along with the other associated metadata.

Option B - Starting with BioProject: Given that all the information sought ties back to BioProjects, initiating queries with BioProject IDs could streamline the process of gathering associated samples and metadata. BioProjects are designed to organize research data across various databases, including SRA, making them a logical starting point for a structured query plan. However, this approach may involve additional steps to then retrieve specific SRA records linked to each BioProject.

Why not just solve this by looking at API documentation?

Generally, in such cases, I would consult API documentation to understand the structure of responses and the relationships between databases, then identify the most efficient querying strategy. There are a few blockers, though.

One of the major problems here is I don't know to what extent NCBI enables returning these data as XML, JSON, or TXT (under what circumstances and why).
Aside from this, while starting with SRA seems more direct; option B (Bioproject) could necessitate fewer API calls overall, but then produce more traversal time to drill down to individual sample metadata in a batched query ...

Any thoughts or recommendations?

NCBI esummary elink API efetch • 1.1k views

ADD COMMENT • link updated 7 months ago by GenoMax 147k • written 8 months ago by LauferVA 4.5k

0

Entering edit mode

8 months ago

GenoMax 147k

NCBI makes metadata for all submissions avaiable here: https://ftp.ncbi.nih.gov/sra/reports/Metadata/

Since you probably have local compute you can use, download the files and parse as needed.

ADD COMMENT • link 8 months ago by GenoMax 147k

0

Entering edit mode

hi geno - this is definitely helpful. the thing is though, i dont think all metadata i need will be in SRA's metadata, necessarily, even - even if presently, not for next set of queries.

this question was mostly about knowing how to structure the API calls.

however this may be of use in figuring out the JSON/XML/TXT issue. Thank you!

VL

ADD REPLY • link 8 months ago by LauferVA 4.5k

0

Entering edit mode

I need to make some fairly extensive calls for RNA-Seq records

You should not be doing any API calls if you want to interrogate the entire SRA for the initial selection. These files should have enough for you to create a list that you can use for the later calls (If you need any additional info).

ADD REPLY • link 8 months ago by GenoMax 147k

0

Entering edit mode

im already mostly done, and i think its with a minimum of outlay to NCBI. thanks for your input but this approach is fine and is effectively already finished.

VAL

ADD REPLY • link 8 months ago by LauferVA 4.5k

0

Entering edit mode

GenoMax was just snooping around on the ftp site you linked - it doesnt look like there is a similar flat file for GEO - is that your understanding?

sra, biosamples, bioproject do have them, but GEO appears to be different (I think dbGaP as well)

ADD REPLY • link 7 months ago by LauferVA 4.5k

0

Entering edit mode

Try here: https://ftp.ncbi.nih.gov/geo/datasets/

soft/
    GDSxxx.soft.gz
        gzipped SOFT files by DataSet (GDS)
        GEO DataSets (GDS) are curated sets of comparable GEO Sample (GSM) data.
        GDS data tables contain VALUE measurements extracted from original Sample
        records.

SOFT (Simple Omnibus in Text Format) is a compact, simple, line-based,
ASCII text format that incorporates experimental data and metadata.

Example: https://ftp.ncbi.nih.gov/geo/datasets/GDS1nnn/GDS1027/soft/GDS1027.soft.gz

ADD REPLY • link 7 months ago by GenoMax 147k

score 1 · Accepted Answer · 2024-02-27

The ObtainEntrezMetadata.py3 script at this github repository will pull approximately 1M records every 10 minutes, parse any XML or other native data, and print to a tab-delimited format.

Because the script relies heavily upon calls to the Document Summaries, which are stored on the website's front end, the calls have extremely low overhead to NCBI (for much of the workflow, calls never actually hit the back end at all); for related reasons they also run very quickly, and require very little CPU time on local. Finally, because file dumps occur every ~160K records, not enough data are held in memory concurrently to tax standard laptop computers.

In terms of more quantitative benchmarks, initial testing of this workflow pulled down all metadata relating to h. sapiens and m. musculus from SRA, BioSample, and BioProject (~3.23M records) or about 65Gb of raw metadata in 25 minutes.