Question: What Is The Best Gene Identifier System To Use For A Bacterial Genome?
gravatar for Michael Barton
8.0 years ago by
Michael Barton1.8k
Akron, Ohio, United States
Michael Barton1.8k wrote:

I'm preparing to submit a draft bacterial genome to GenBank. The submission specifications require each gene have a unique identifier. What is a good identifier system to use? Start at 1 then use increasing numbers clockwise from the origin of replication?

This system seems fragile though. For instance how should newly discovered genes, post submission, be identified? What if you wish to reassemble and reannote the genome in light of new sequencing data?

Are there any interesting alternative systems? E.g. using the digest of the gene sequence.

ADD COMMENTlink written 8.0 years ago by Michael Barton1.8k
gravatar for Lyco
8.0 years ago by
Lyco2.3k wrote:

I work a lot with bacterial genome data from a lot of places. What most people use is something along the lines of "MB12345.1" (assuming that you have sequenced Miraculobacillus Bartonii), this is 2 letters of species abbreviation, followed by an ORF number (ideally in the order as the genes are found in the genome) followed by a version number (typically the version of the assembly). What some people do, particular those in the eukaryotic area, is to leave some room for newly identified genes, e.g. by using MB123450, MB123460...

Generally, these identifiers are pleasant to work with. If I would have to invent something new, I would probably put the assembly number next to the species, maybe like MB1_12345. However, it is probably advisable to stick to the conventions.

ADD COMMENTlink written 8.0 years ago by Lyco2.3k
gravatar for Michael Kuhn
8.0 years ago by
Michael Kuhn5.0k
EMBL Heidelberg
Michael Kuhn5.0k wrote:

In the end, the unique identifiers will only map to one specific version of the genome and its annotation. Through re-sequencing and re-annotation, sequences might change etc., and any additional info that you encode in the ids could be invalidated.

The ids need to short, so that humans can quickly recognize / distinguish them (and many programs have a gene name limit). So I just cannot see the advantage of URNs / hashes / digests. You don't need any fancy namespaces / prefixes: it's clear what species this is.

ADD COMMENTlink written 8.0 years ago by Michael Kuhn5.0k
gravatar for Pierre Lindenbaum
8.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:

My two cents: I would use a short URN/URI as an unique identifier (for e.g see the, now deprecated, LSID ) that would include the version number of your assembly, of your contig, of your annotation. You could use the position in the contig to identify the gene itself.

something like:


a short hash would be another good idea, but people would need to resolve it.


hey another idea ! a tweet ID!!! geneid:78892400524791808 :-)

ADD COMMENTlink written 8.0 years ago by Pierre Lindenbaum120k

yes! use the tweet ID.

ADD REPLYlink written 8.0 years ago by brentp23k
gravatar for Pasta
8.0 years ago by
Pasta1.3k wrote:

In our lab we used the first 3 letters of our bacteria, to which we added a 5-digits number (increasing clockwise from the origin of replication) incremented by 10. This gives you space in case you have to add new features to your GBK file


A bug called "E. coli strain K" would produce:





ADD COMMENTlink written 8.0 years ago by Pasta1.3k
gravatar for Nicojo
8.0 years ago by
Kyoto, Japan
Nicojo1.1k wrote:

Another thing to consider: we are increasingly realizing that each individual (or clonal population thereof) that is sequenced is proving to contain differences. The most accepted and acknowledged ones are of course SNPs, but indels and copy number polymorphisms are also very common. In addition to those, there are a lot of rearrangements (at least for organisms with linear chromosomes) and horizontal gene transfers from other individuals/populations...

All this tells me that at some point not too far away we will be considering the sequence of each individual within a species independently from the others. We will eventually need to be able to distinguish between strains.

Note that a strain (especially bacterial) cultivated in vitro for hundreds/thousands of generations is likely to have a different genome from that of the original strain. So even with the same name, it might not be the same.

Today, I haven't seen anyone take this into account in their sequencing/annotating/naming efforts. But I do know that it is an issue in some fields of research, like parasitology. I have seen, at a conference, several attendees verbally fighting over conflicting results. In the end they agreed that although they were using the same "strain", it had been cultivated for a long time and each of their aliquots had certainly evolved very differently.

I would suggest including something in the name to specify which isolate you are annotating. For that matter, I think Pierre's suggestion may be a good one (URN or URI).

ADD COMMENTlink written 8.0 years ago by Nicojo1.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 696 users visited in the last hour