News:New NCBI Datasets home and documentation pages provide easier access
1
5
Entering edit mode
3.0 years ago
e.cox ▴ 50

NCBI Datasets, the new set of services for downloading genome assembly and annotation data, has redesigned and reorganized web pages to make it easier to find and access the services and documentation you need.

NCBI Datasets has a fresh new homepage highlighting the types of data available through our tools. Available data include genome assemblies, genes, and SARS-CoV-2 genomic and protein data. You can easily access these from the new page or learn more with our new documentation pages.

Our new NCBI Datasets documentation will help you get answers faster. If you are new to Datasets try our Quickstarts to quickly get started using our web pages and tools. How-tos describe common workflows and data requests and provide multiple solutions — our web pages, command-line tools, python and R packages.

For example, if you need to download human genome data, including sequence, annotation and metadata, see the Download genome data How-to guide to get data using the Genomes web page, datasets command-line tool, python and R.

See the full blog post on NCBI Insights.

ncbi gene SARS-CoV-2 genome • 1.9k views
ADD COMMENT
2
Entering edit mode
3.0 years ago

I have been trying to use datasets in education, I think conceptually it is an excellent idea but the implementation and interfaces are inefficient, limited, and tedious to the extreme.

It promotes an awkward, verbose, free text-based parameters scheme, fundamentally unlike how command-line tools are supposed to work!

It is as if the implementers never used the UNIX command-line themselves. Reminds me of the incredible awkwardness of GATK that, in the end, had to also be rewritten into proper Unix parameters format (by version 4). For example, take this command

datasets summary genome accession GCF_000001405.39

The above formalism does not match the way scientists and bioinformaticians use command-line tools.

In the world of data integration, why can't a tool figure out from GCF_000001405.39 that it is a genome accession???? I know this even without having to do a search. How come NCBI datasets is unable to figure out what it is without someone explicitly typing "genome accession"?

Here is what the tool should look like:

datasets GCF_000001405.39 --summary

or if I want a fasta file I should do:

 datasets GCF_000001405.39 --fasta 

and so on, if I wanted the metadata for the accession number in csv format this is how it should work:

 datasets GCF_000001405.39 --metadata --format csv 

In the current implementation every invocation of the datasets tool downloads gigantic gzipped blob files, that when unpacked will contain files and directories all named the same way ncbi_data that, prots.fa etc ... a nightmare in data management. The same "blob" mentality that permeates SRA is now in datasets.

To extract the metadata from the "blob" I will need to understand, download and run another tool dataformats that will operate on some obtuse named file data_table.json (or whatever)?

I could go on, but suffice to say I consider datasets an excellent idea being ruined by a terrible interface and desing.

I look at datasets and I see a tool designed by "programmers" that work by the hour, to implement "features" listed on a whiteboard - it is not a tool designed by scientists trying to perform a scientific analysis.

ADD COMMENT
0
Entering edit mode

Hi Istvan,

Thanks for your feedback.

The datasets command-line tool is a work in progress and we will carefully consider your comments as we continue to develop the tool.

One of the main goals of the NCBI Datasets project is to get feedback from the community that will help us improve our tools.

We have interviewed dozens of users throughout the course of development and many of our design decisions have been informed by the feedback that we have received.

We continue to welcome all feedback and we're happy to see the community discuss our tools on Biostars.

We also encourage users to contact us directly with any suggestions or questions by email at info@ncbi.nlm.nih.gov

Thanks, NCBI Datasets Team

ADD REPLY
0
Entering edit mode

my apologies if I came across a little antagonistic - I feel a frustration seeing a good idea taking the wrong turn

I will say this is not about interviewing a few people with various backgrounds - as NCBI you are building a tool for the entire world, and that should not work based on local opinions. It is about following the standards when it comes to command-line interfaces. There is no reason to start to invent "new" methods, especially not free text-like interfaces. Instead of interviewing people, I would recommend looking at how most bioinformatics tools work: bwa, bedtools, each has subcommands, each one is self-documenting, each one has well defined named parameters rather than positional words:

 bwa mem 

and it tells you how it works. There is a reason these tools look like that, long honed during usage.

I do recognize that designing APIs is very hard, especially when it comes to such a gigantic data repository that you already have. But I strongly urge you to re-evaluate what you are doing now. You are designing for local minima, instead of a simple, logical and coherent data model.

Take for example your SARS-COV-2 viral package. Here is how it works:

datasets download virus genome taxon SARS2 --host human --complete-only

the command above will download a blob file. How is that any better than rsync-ing a prebuilt file like so?

rsync -avz https://www.ncbi.com/file.tar.gz

It is not! Not only it is not better your method is inferior to rsync. rsync can do differential transfer even on single files. If nothing has changed or just one file was added to the gzip, it will transfer only that.

When using datasets we have to download the same blob file over and over again. Gigabytes of unnecessary transfer take place each time I want to get the most up-to-date information. even If just one more genome is added, we have to go download the ever-increasing data ... I see this as an impossible race.

datasets should be a tool that tells us where is the file that we need, not a tool to actually download it. There are countless efficient ways to transfer large files of various kinds, the bottleneck is that we don't know what to downlaod.

ADD REPLY

Login before adding your answer.

Traffic: 2559 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6