How to get sample details from NCBI datasets
1
1
Entering edit mode
23 months ago
beginner123 ▴ 30

Hi, I have a list of assembly accession number like this GCF_000421605.1 GCF_001652585.1 GCF_012317585.1 GCF_011207455.1

How can I automatically get the sample details of these assemblies as a list?

enter image description here

Assembly Datasets Sample NCBI Accession details • 1.5k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
2
Entering edit mode
23 months ago
vkkodali_ncbi ★ 3.7k

You can do this using NCBI Datasets command-line tool. First, make a list of all the assembly accessions and use NCBI Datasets to download a package (we use --dehydrated flag to fetch only metadata and skip all sequence and annotation data). Then, use the dataformat tool to convert the assembly report from jsonl to a tabular format.

## input file
$ cat accs.txt 
GCF_000421605.1
GCF_001652585.1
GCF_012317585.1
GCF_011207455.1
## download using datasets 
$ datasets download genome accession --inputfile accs.txt --dehydrated
## contents of the datasets package 
$ unzip -v ncbi_dataset.zip 
Archive:  ncbi_dataset.zip
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
    1604  Defl:N      769  52% 2022-06-04 08:47 3de26d82  README.md
   12987  Defl:N     3514  73% 2022-06-04 08:47 53d1bc19  ncbi_dataset/data/assembly_data_report.jsonl
    4497  Defl:N      782  83% 2022-06-04 08:47 af0e24e7  ncbi_dataset/fetch.txt
    2522  Defl:N      385  85% 2022-06-04 08:47 38db1709  ncbi_dataset/data/dataset_catalog.json
--------          -------  ---                            -------
   21610             5450  75%                            4 files
## extract metadata into a table using dataformat 
$ dataformat tsv genome --package ncbi_dataset.zip > assm_tbl.txt

Note, the output table assm_tbl.txt is quite big with 95 fields. But it should be feasible to load the file into a spreadsheet app and filter data as needed. Alternately, if you are interested in only a specific set of biosample attributes you can use dataformat to extract only those. Finally, if you are conversant with JSON, you can use a tool like jq (https://stedolan.github.io/jq/) to conditionally extract only certain fields.

ADD COMMENT

Login before adding your answer.

Traffic: 1861 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6