For part of my current project I'm trying to create a simple local store of NCBI sequence files indexed by their unique IDs in the database. Of course NCBI records have several IDs from different systems, and I'm still trying to make sense of how they all work. Specifically for this project I'm working with RefSeq entries in the Assembly database but I'd like to make it work for other sequence databases, e.g. the Nucleotide database. The current record metadata I'm working off of is obtained from the Entrez EUtils ESummary tool using UIDs I got from ESearch on the assembly database.
My current understanding is that each of these values form their own unique identifier for all sequence records, across all NCBI databases:
- Entrez UID (I think this is not unique across databases, so the database name needs to be included as well). This is under "uid" in the ESummary results for the Assembly database, as well as on the web page for a record.
- Genbank ID. Under "gbid" in ESummary results.
- RefSeq ID. Under "rsid" in ESummary results. Can't tell if this is ever different from the Genbank ID if it exists.
- Genbank accession, including version number.
- RefSeq accession, including version number.
It seems like the Entrez UID and Genbank ID are completely separate. As for the RefSeq ID, I don't know if this always matches the Genbank ID or what it would mean if they're different.
The assembly database also has the RefSeq accession in the "assemblyaccession" field, I can't figure out if this is different than a standard accession number. The "GCF_" prefix I see on all these isn't listed in the page for Genbank accession number formats or the equivalent for RefSeq. I know the equivalent Genbank prefix is "GCA_" but it doesn't seem to have its own field for assembly records.
I also see a lot of mentions of GI number. I can't tell if this is the same as the Genbank ID or not.
Can anyone shed any light on any of this? I'm not finding the documentation on the NCBI site too helpful.