Metadata for assemblies deposited in NCBI
1
0
Entering edit mode
14 months ago
blur ▴ 280

Hi,

I have downloaded 700+ assemblies from NCBI for a phylogenetic tree. The submitter on most of them is NCBI - which I doubt is the original submitter, as in several cases the identifiers marked 'NCBI' match assemblies from published papers that are not from NCBI.

I think the right data would be under the SioSample information - how can I batch download that data and match it to the assembly accession number?

Is there a way to get all the metadata for these assemblies? I could only find the submitter as reference in NCBI...

Any help would be greatly appreciated.

ncbi metadata assembly • 845 views
ADD COMMENT
1
Entering edit mode

Can you provide some example assembly accession?

I have an example here that may be useful: How to download sample attributes (sample metadata) file from the European nucleotide archive (EMBL-EBI)?

ADD REPLY
1
Entering edit mode
14 months ago
MirianT_NCBI ▴ 720

Hi,

You can use NCBI Datasets to retrieve metadata from those assemblies. You have multiple ways of achieving what you need. For simplicity, I'll post two options:

  1. You can download the entire metadata report and process it using either dataformat or jq. I'm assuming you have a text file with all the assemblies you need metadata for, one per line.
datasets summary genome accession --inputfile assemblies.txt --as-json-lines > assemblies-metadata.jsonl
dataformat tsv genome --inputfile assemblies-metadata.jsonl --fields accession,assminfo-submitter

This will produce a TSV file with the assembly accession and submitter.

  1. Instead of downloading the entire metadata file, you can pipe the two commands above and extract the information you need.
datasets summary genome accession --inputfile assemblies.txt --as-json-lines |  dataformat tsv genome --fields accession,assminfo-submitter

For example:

cat assemblies.txt
GCF_000002285.5
GCF_005444595.1
GCF_011100685.1

datasets summary genome accession --inputfile assemblies.txt  --as-json-lines |  dataformat tsv genome --fields accession,assminfo-submitter

Assembly Accession      Assembly Submitter
GCF_000002285.5 Dog Genome Sequencing Consortium
GCF_005444595.1 University of Michigan
GCF_011100685.1 Uppsala University

I hope it helps!

ADD COMMENT

Login before adding your answer.

Traffic: 2259 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6