Hi Biostars,
I need to make some fairly extensive calls for RNA-Seq records in SRA, then annotate them with metadata, and I'd like to do this the most efficient way. I'll provide some relevant background information first, then try to ask a specific question relating to strategic query design...
Background:
At present, I am most interested project-level metadata. For each SRA record, I would like to obtain: the BioProject Accession, the BioSample Accession, the Center ID (e.g. GSE___ or what have you) the name, title and description of the project the organism name, the organism strain if relevant sequencing info, including the technology type (Illumina, IonTorrent) the sequencer type (HiSeq 2500, NovoSeq 6000, etc), library info (single end paired end, what is targeted (e.g. AAAA tail)).
To do this efficiently without wasting NCBI's resources, I have been studying online documentation of NCBI's API as well as the (extensive!) capabilities of esearch
, elink
, and efetch
. I have also looked into NCBI's API query rate and max record limits (https://www.ncbi.nlm.nih.gov/books/NBK25497/, "Minimizing Number of Requests"; as well as recommendations for implementation https://www.ncbi.nlm.nih.gov/books/NBK25498, "Application 3: Retrieving Large Datasets”).
I do NOT have questions about: getting an API key, batch queries, or running queries concurrently without exceeding API call thresholds - I feel OK about implementing that.
Question:
At this time, I have 2 working query formats, but I am having trouble evaluating which is likely to be more performant for a large number of records. Part of this relates to due to differences in XML/JSON/TXT output (I am not sure when you can receive each, to be honest), and part relates to a more general conceptual question - Do you have recommendations for which approach, below, will have better complexity and data retrieval speed?
Option A - Starting with SRA: Querying SRA directly seems intuitive because RNA-seq samples are stored in this database. SRA entries are linked to BioSample and BioProject databases - from which metadata can be obtained. I think this approach could be more efficient if the SRA database's schema readily provides links to the specific BioProject and BioSample IDs needed, along with the other associated metadata.
Option B - Starting with BioProject: Given that all the information sought ties back to BioProjects, initiating queries with BioProject IDs could streamline the process of gathering associated samples and metadata. BioProjects are designed to organize research data across various databases, including SRA, making them a logical starting point for a structured query plan. However, this approach may involve additional steps to then retrieve specific SRA records linked to each BioProject.
Why not just solve this by looking at API documentation?
Generally, in such cases, I would consult API documentation to understand the structure of responses and the relationships between databases, then identify the most efficient querying strategy. There are a few blockers, though.
- One of the major problems here is I don't know to what extent NCBI enables returning these data as XML, JSON, or TXT (under what circumstances and why).
- Aside from this, while starting with SRA seems more direct; option B (Bioproject) could necessitate fewer API calls overall, but then produce more traversal time to drill down to individual sample metadata in a batched query ...
Any thoughts or recommendations?
hi geno - this is definitely helpful. the thing is though, i dont think all metadata i need will be in SRA's metadata, necessarily, even - even if presently, not for next set of queries.
this question was mostly about knowing how to structure the API calls.
however this may be of use in figuring out the JSON/XML/TXT issue. Thank you!
VL
You should not be doing any API calls if you want to interrogate the entire SRA for the initial selection. These files should have enough for you to create a list that you can use for the later calls (If you need any additional info).
im already mostly done, and i think its with a minimum of outlay to NCBI. thanks for your input but this approach is fine and is effectively already finished.
VAL
GenoMax was just snooping around on the ftp site you linked - it doesnt look like there is a similar flat file for GEO - is that your understanding?
sra, biosamples, bioproject do have them, but GEO appears to be different (I think dbGaP as well)
Try here: https://ftp.ncbi.nih.gov/geo/datasets/
Example: https://ftp.ncbi.nih.gov/geo/datasets/GDS1nnn/GDS1027/soft/GDS1027.soft.gz