I have a list of ENA accession numbers pointing to sequences that are sometimes contigs, sometimes scaffolds. I want to group together sequences that come from the same genome, if possible by exploiting the format of the accession number.
EBI-ENA website gives some rules about the format of accession number. In particular:
Assembled/Annotated sequences
[A-Z]{1}\d{5}.\d+
[A-Z]{2}\d{6}.\d+
[A-Z]{4}S?\d{8,9}.\d+
They do not explain however:
- why there are 3 possibilities
- if some parts of the accession number are specific of a genome/set of sequences
For example, I noticed so far that the third rule seems to be used for contigs (example: JQEY01000011), that the first 4 letters seems to be genome-specific, and that the following digits seems to be contig-specific. I am not sure however if I can rely on this to group my list of accession numbers. Furthermore, this breaks for scaffold accession numbers since they do not seem to display any such simple pattern.
Are there rules for the formatting of accession number in EBI-ENA that can be used reliably to group together assembled/annotated sequences that come from the same genome?
Contacting datasubs at ebi.ac.uk may be a good place to start.
Rules you are referring to are specifying what the identifier looks like. The link between which organism they specify (e.g.
Maribius sp. MOLA 401
corresponds toJQEY01000000
, it is not immediately apparent howMaribus
andJQEY
are related) may not be inferable without aname-key
file.I do not really need the organism name since I don't mind working with
ids
. What I do need however is to be able to identify sequences (contigs
orscaffolds
) that belong to the same set (WGS project, ...).While it looks straightforward for
contig ids
(JQEY01000001-JQEY01000033
all belong to the setJQEY01000000
), it is less clear when looking atscaffold ids
. For instance, the followingscaffold ids
look pretty similarKI912157
,KI912158
,KI912155
,KI912156
,KI912577
but they do not all come from the same organism. What I would like is a similarity rule to decipher if a set ofscaffold ids
come from the same organism.I will probably try contacting datasubs like you suggest.