Question: What is unique identifier for organism in NCBI gbff file?
gravatar for Nari
15 months ago by
United States
Nari870 wrote:

Dear all, Wish you a Happy new year 2018.

I am trying to parse NCBI gbff files for genomes from NCBI. I am especially interested in coding sequences from one organism at a time irrespective of the plasmid or multiple chromosomes, with a single standard Identifier in CDS fasta headers. As refseq accession is different for multiple chr. or plasmids of same organism, I can not use them as identifier for one organism. (gbff genbank now contains everything [all chr. and plasmids] in one file)
I have 3 types of unique ids for a bacterium: Escherichia coli strain UCD_JA03
BioProject: PRJNA224116
Assembly: GCF_000599725.1
BioSample: SAMN02650859
What is the best Identifier to move forward with, With no chance of redundancy for each bacterial strain which is common for chr. or plasmids of same strain of bacteria?
(Although I can programically generate an identifier, I want to stick to standard identifier for more clarity.)

sequence genome • 686 views
ADD COMMENTlink modified 15 months ago by Michael Dondrup46k • written 15 months ago by Nari870

How about the NCBI taxonomy id?

ADD REPLYlink written 15 months ago by Michael Dondrup46k
gravatar for Michael Dondrup
15 months ago by
Bergen, Norway
Michael Dondrup46k wrote:

Astonishingly none of the above ;) I would use the Assembly, but there is an updated version GCF_000599725.2 Then you can also put the species and strain into the Fasta header as a further reference.

Escherichia coli strain UCD_JA03

Not found in the NCBI taxonomy.

BioProject: PRJNA224116

Is a multi-species project and won't help to distinguish species or strains.



Most unique and specific, but there is an update.

BioSample: SAMN02650859

Does not indicate the assembly version

ADD COMMENTlink modified 15 months ago • written 15 months ago by Michael Dondrup46k

Using assembly accession numbers may be the best thing, though they are not single sequence records. NCBI assembly database which provides stable accessioning and data tracking for genome assembly data points back to that number. There are some other ID's which may be searchable through eUtils (IDs: 569591 [UID] 2551208 [GenBank] 2599488 [RefSeq])

ADD REPLYlink modified 15 months ago • written 15 months ago by genomax65k

Thanks for your insights. I just wanted uniqueness within a gbff file (all chr.s and plasmids). And not common for different strain of desired organism. The different assembly version of same strain may not be a problem, as I will find orthologous clusters. What would you suggest considering this.

ADD REPLYlink modified 15 months ago • written 15 months ago by Nari870
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1109 users visited in the last hour