How to obtain all accession.version identifiers for a BLAST database using ENTREZ
1
0
Entering edit mode
4.4 years ago

Hi all,

background: I need to be able to recognize all possible sequence identifiers present in preformatted NCBI nucleotide databases. I've implemented regular expression following https://www.ncbi.nlm.nih.gov/Sequin/acc.html, but it is not enough. Other accessions (e.g. PDB) are also present. So I would like to have examples of all possible formats I can encounter. But I was not able to find any list which would describe what actually can be inside those databases.

One possible solution, I thought would be to use ENTREZ to retrieve the accessions for me. There is blastdbinfo database which lists the avalible databases. But I not able to get elink to link anywhere.

Lets for example focus on refseq_genomes. The database is available with following command:

esearch -query refseq_genomes[DB] -db blastdbinfo

So given that I want nucleotide sequence accessions present in that database what the elink statement should be?

esearch -query refseq_genomes[DB] -db blastdbinfo | ... SOME ELINK .... | efetch --format acc

For ENTREZ experts here - How do I tell which database links where?

I know I can download the databases and use blastdbcmd to obtain the accessions, but It should be possible to obtain the accessions in some better way.

Thank you

blast entrez elink • 1.6k views
ADD COMMENT
1
Entering edit mode

For a given db, you can find all available link names and a brief description as follows:

einfo -db blastdbinfo

The Entrex Link Descriptions webpage also lists this information but I am not sure how up-to-date that is. It looks like blastdbcmd may be the best solution for you.

ADD REPLY
0
Entering edit mode

Thank you for the link. According to that it looks like there is no direct link between blastdbinfo and e.g. nuccore.

ADD REPLY
1
Entering edit mode
4.2 years ago

To anybody who might want to retrieve accessions from BLAST database without downloading whole database.

I didn't find any route through ENTREZ, however, I've found that you don't need the whole BLAST database to retrieve the accessions for the sequences.

Only the .nhr, .nin and .nal (.ndb if .nal is not present) files are required to call the blastdbcmd -entry all -outfmt "%a".

According to this documentation (http://nebc.nerc.ac.uk/bioinformatics/documentation/blast/formatdb.html) The .nhr are headers for the fasta files and .nin are the indices. The nal file describes the database.

Warning - to my knowledge this is not documented and can stop working at any time. Tested on db v4 and v5.

ADD COMMENT
0
Entering edit mode

NCBI switched to v5 databases as of Feb 4th, 2020. This still seems to be working fine as far as I can tell.

ADD REPLY
0
Entering edit mode

Yes I agree, I've also tested it on db v5.

ADD REPLY

Login before adding your answer.

Traffic: 1513 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6