How to obtain all accession.version identifiers for a BLAST database using ENTREZ
1
0
Entering edit mode
2.9 years ago

Hi all,

background: I need to be able to recognize all possible sequence identifiers present in preformatted NCBI nucleotide databases. I've implemented regular expression following https://www.ncbi.nlm.nih.gov/Sequin/acc.html, but it is not enough. Other accessions (e.g. PDB) are also present. So I would like to have examples of all possible formats I can encounter. But I was not able to find any list which would describe what actually can be inside those databases.

One possible solution, I thought would be to use ENTREZ to retrieve the accessions for me. There is blastdbinfo database which lists the avalible databases. But I not able to get elink to link anywhere.

Lets for example focus on refseq_genomes. The database is available with following command:

esearch -query refseq_genomes[DB] -db blastdbinfo


So given that I want nucleotide sequence accessions present in that database what the elink statement should be?

esearch -query refseq_genomes[DB] -db blastdbinfo | ... SOME ELINK .... | efetch --format acc


For ENTREZ experts here - How do I tell which database links where?

I know I can download the databases and use blastdbcmd to obtain the accessions, but It should be possible to obtain the accessions in some better way.

Thank you

blast entrez elink • 1.1k views
1
Entering edit mode

For a given db, you can find all available link names and a brief description as follows:

einfo -db blastdbinfo


The Entrex Link Descriptions webpage also lists this information but I am not sure how up-to-date that is. It looks like blastdbcmd may be the best solution for you.

0
Entering edit mode

Thank you for the link. According to that it looks like there is no direct link between blastdbinfo and e.g. nuccore.

0
Entering edit mode
2.6 years ago

I didn't find any route through ENTREZ, however, I've found that you don't need the whole BLAST database to retrieve the accessions for the sequences.

Only the .nhr, .nin and .nal (.ndb if .nal is not present) files are required to call the blastdbcmd -entry all -outfmt "%a".

According to this documentation (http://nebc.nerc.ac.uk/bioinformatics/documentation/blast/formatdb.html) The .nhr are headers for the fasta files and .nin are the indices. The nal file describes the database.

Warning - to my knowledge this is not documented and can stop working at any time. Tested on db v4 and v5.

0
Entering edit mode

NCBI switched to v5 databases as of Feb 4th, 2020. This still seems to be working fine as far as I can tell.

0
Entering edit mode

Yes I agree, I've also tested it on db v5.