Looking for information on designing effective accession numbers.
3
2
Entering edit mode
7.6 years ago
tyler.weirick ▴ 120

I am working on a paper that will feature a database with novel lncRNAs. I have searched for several days trying to find information about good design practices for accession numbers. Unfortunalty everything seems to contain the word accession, so I am having difficulty finding material (if it even exists).

1. Does anyone know of design guidelines or methods for making effective accession systems?

2. Does anyone have thoughts on how to effectively encode information into accession numbers? For example ENSEMBL's includes the organism and sequence type ex: ENSMUSG00000041147 (ENS(organism)(sequence type)00000041147 ) into their accession numbers.

accession annotation genome database • 1.7k views
4
Entering edit mode

Hi Tyler, I'm part of a large community of data integrators that has been grappling with identifier issues in the life sciences. We've just submitted a manuscript proposing some standards. Excerpted text is below, but a bit out of context, so please also see the section on CURIEs (Compact URIs) which we believe is a very important up-front consideration.

"When designing new identifiers, be explicit about what it is they identify, but carefully consider how to convey this meaning--whether embedded in the identifier itself, or in the metadata. Meaning is never required to be embedded in an identifier. Meaning may be embedded where 1) durable, 2) coarse-grained, 3) uncontested, and also 4) useful to the data consumer, but only if all four conditions apply without potential edge cases.

Except where durable and deterministic (e.g. an InChI string identifying a chemical structure), you should not embed information that is at the per-entity level (such as name or label). Never embed its type if an entity could change from one type to another, for example if the type depends on the entity’s developmental stage or if the typing nomenclature is not well defined scientifically. If a database name may change, it should not be embedded. These rules of thumb apply especially to [Local Resource Identifiers (LRIs)] but also to the path of the full URIs. Keep in mind that each [Compact URI] prefix must correspond 1:1 with a resolving namespace. If possible, avoid varying URI paths by entity type, authority, etc... as this can be confusing for users."

The submitted manuscript was deposited in Zenodo:

McMurry, Julie et al. (2015). 10 Simple rules for design, provision, and
reuse of persistent identifiers for life science data
doi:10.5281/zenodo.18003 <http://dx.doi.org/10.5281/zenodo.18003>
0
Entering edit mode

Hi Julie, I liked your paper. I am finishing up my own paper and would like to cite yours. Do you have any estimate regarding when it will be published?

1
Entering edit mode

If I were you, I would follow EnsEMBL's design except using fewer digits. Don't forget the versioning bit which you have not mentioned. It is as important.

2
Entering edit mode
7.6 years ago
John 13k

All experiences i've ever have with 'Encoded IDs' have been really really bad. Often an ID just needs to serve 1 purpose - to be a unique identifier for an entity, typically a row in a database, from which to SELECT or JOIN on.
Encoding extra information into a unique ID rarely does you any favours, and if anything just imposes a bunch of restraints on what counts as an  'acceptable' ID. It is the database equivalent of condensing all your columns into a single text string, and in the process removing all constraints, indexing and the ability to modify data. It leads to all sorts of weird behaviours and headaches down the road for both the implementers and the users.

There are circumstances where an ID could have some structure; particularly useful is when the first few letters tell you which database the ID comes from, and the last few characters represent a version number - but beyond that you should really just use a plain, incrementing numbers of varying length to give you the uniqueness, but also as much flexibility down the road as possible. Consider the following:

- Will the format and all valid substrings of an ID be intuitively known to new users, or could it lead to confusion when, for example, Mus stands for muscle and not mouse?

- Once a Unique ID is generated, will any of that encoded data be liable to change? For example, if you encoded lncRNA family into the ID, but new data shows that a certain lncRNA actually belongs to family B not family A, will you change the ID, breaking all existing references to the lncRNA, or will you maintain both the old (incorrect) ID and the new one?

- Is the database implementation that gets data for a given ID so slow or complicated that it saves much time/complexity storing data in the ID itself?

- Will you store all the data captured in the ID in separate columns anyway?

I obviously dont know your circumstances or requirements, so maybe none of this is applicable, but personally I feel the less parsing you make your users do and the more your database does, the happier everyone will be :)

1
Entering edit mode
7.6 years ago

Have a look at LSID : https://en.wikipedia.org/wiki/LSID http://www.ncbi.nlm.nih.gov/pubmed/15153306

The idea has never took off  (people want URL-based identifiers ) but there was plenty of good ideas ( resolving a LSID returns a RDF document... )

1
Entering edit mode

LSID may be good universal identifiers, but it is very bad as accessions. Accession needs to be short to display (e.g. in a browser) and informative for human to read and even occasionally to remember.

1
Entering edit mode
7.6 years ago

I would shy away from including meaning in the ID (see TCGA for examples of well-thought-out accessions that did not stand the test of time).  Instead, focus on each ID having a clear, one-to-one, and immutable correspondence with what it represents.  Inherent in the concept of IDs is the need to develop a versioning scheme as well as a retirement strategy (what to do with IDs that need to be removed or replaced).