Question

News:New NCBI Datasets home and documentation pages provide easier access

5

Entering edit mode

3.5 years ago

e.cox ▴ 50

NCBI Datasets, the new set of services for downloading genome assembly and annotation data, has redesigned and reorganized web pages to make it easier to find and access the services and documentation you need.

NCBI Datasets has a fresh new homepage highlighting the types of data available through our tools. Available data include genome assemblies, genes, and SARS-CoV-2 genomic and protein data. You can easily access these from the new page or learn more with our new documentation pages.

Our new NCBI Datasets documentation will help you get answers faster. If you are new to Datasets try our Quickstarts to quickly get started using our web pages and tools. How-tos describe common workflows and data requests and provide multiple solutions — our web pages, command-line tools, python and R packages.

For example, if you need to download human genome data, including sequence, annotation and metadata, see the Download genome data How-to guide to get data using the Genomes web page, datasets command-line tool, python and R.

See the full blog post on NCBI Insights.

ncbi gene SARS-CoV-2 genome • 2.0k views

ADD COMMENT • link updated 3.5 years ago by Istvan Albert 101k • written 3.5 years ago by e.cox ▴ 50

score 2 · Answer 1 · 2021-04-21

I have been trying to use datasets in education, I think conceptually it is an excellent idea but the implementation and interfaces are inefficient, limited, and tedious to the extreme.

It promotes an awkward, verbose, free text-based parameters scheme, fundamentally unlike how command-line tools are supposed to work!

It is as if the implementers never used the UNIX command-line themselves. Reminds me of the incredible awkwardness of GATK that, in the end, had to also be rewritten into proper Unix parameters format (by version 4). For example, take this command

datasets summary genome accession GCF_000001405.39

The above formalism does not match the way scientists and bioinformaticians use command-line tools.

In the world of data integration, why can't a tool figure out from GCF_000001405.39 that it is a genome accession???? I know this even without having to do a search. How come NCBI datasets is unable to figure out what it is without someone explicitly typing "genome accession"?

Here is what the tool should look like:

datasets GCF_000001405.39 --summary

or if I want a fasta file I should do:

 datasets GCF_000001405.39 --fasta

and so on, if I wanted the metadata for the accession number in csv format this is how it should work:

 datasets GCF_000001405.39 --metadata --format csv

In the current implementation every invocation of the datasets tool downloads gigantic gzipped blob files, that when unpacked will contain files and directories all named the same way ncbi_data that, prots.fa etc ... a nightmare in data management. The same "blob" mentality that permeates SRA is now in datasets.

To extract the metadata from the "blob" I will need to understand, download and run another tool dataformats that will operate on some obtuse named file data_table.json (or whatever)?

I could go on, but suffice to say I consider datasets an excellent idea being ruined by a terrible interface and desing.

I look at datasets and I see a tool designed by "programmers" that work by the hour, to implement "features" listed on a whiteboard - it is not a tool designed by scientists trying to perform a scientific analysis.