Entering edit mode
9 months ago
marie.harmel
▴
10
Hello,
I recently came across an inconsistency between the number of Sequence Read Archive (SRA) datasets reported on the NCBI website and the count obtained through a SQL query on BigQuery.
As of February 2024, the NCBI website displays a total of 27,102,173 SRA available. .
However, when running the following SQL query on BigQuery:
SELECT DISTINCT m.acc, m.sample_acc, m.biosample, m.sra_study, m.bioproject
FROM `nih-sra-datastore.sra.metadata` as m,
`nih-sra-datastore.sra_tax_analysis_tool.tax_analysis` as tax
WHERE m.acc=tax.acc and m.bioproject IS NOT NULL
ORDER BY m.bioproject, m.sra_study, m.biosample, m.sample_acc
I obtain 25.636.505 SRA.
I am curious to know if this difference in numbers could be attributed to the timing of updates between the NCBI databases on BigQuery and those accessible directly through the NCBI website.
Thank you in advance for your time and assistance.