Question: How to obtain all accession.version identifiers for a BLAST database using ENTREZ
0
gravatar for massa.kassa.sc3na
9 months ago by
massa.kassa.sc3na270 wrote:

Hi all,

background: I need to be able to recognize all possible sequence identifiers present in preformatted NCBI nucleotide databases. I've implemented regular expression following https://www.ncbi.nlm.nih.gov/Sequin/acc.html, but it is not enough. Other accessions (e.g. PDB) are also present. So I would like to have examples of all possible formats I can encounter. But I was not able to find any list which would describe what actually can be inside those databases.

One possible solution, I thought would be to use ENTREZ to retrieve the accessions for me. There is blastdbinfo database which lists the avalible databases. But I not able to get elink to link anywhere.

Lets for example focus on refseq_genomes. The database is available with following command:

esearch -query refseq_genomes[DB] -db blastdbinfo

So given that I want nucleotide sequence accessions present in that database what the elink statement should be?

esearch -query refseq_genomes[DB] -db blastdbinfo | ... SOME ELINK .... | efetch --format acc

For ENTREZ experts here - How do I tell which database links where?

I know I can download the databases and use blastdbcmd to obtain the accessions, but It should be possible to obtain the accessions in some better way.

Thank you

blast entrez elink • 250 views
ADD COMMENTlink modified 5 months ago • written 9 months ago by massa.kassa.sc3na270
1

For a given db, you can find all available link names and a brief description as follows:

einfo -db blastdbinfo

The Entrex Link Descriptions webpage also lists this information but I am not sure how up-to-date that is. It looks like blastdbcmd may be the best solution for you.

ADD REPLYlink written 9 months ago by vkkodali2.1k

Thank you for the link. According to that it looks like there is no direct link between blastdbinfo and e.g. nuccore.

ADD REPLYlink written 9 months ago by massa.kassa.sc3na270
0
gravatar for massa.kassa.sc3na
5 months ago by
massa.kassa.sc3na270 wrote:

To anybody who might want to retrieve accessions from BLAST database without downloading whole database.

I didn't find any route through ENTREZ, however, I've found that you don't need the whole BLAST database to retrieve the accessions for the sequences.

Only the .nhr, .nin and .nal (.ndb if .nal is not present) files are required to call the blastdbcmd -entry all -outfmt "%a".

According to this documentation (http://nebc.nerc.ac.uk/bioinformatics/documentation/blast/formatdb.html) The .nhr are headers for the fasta files and .nin are the indices. The nal file describes the database.

Warning - to my knowledge this is not documented and can stop working at any time. Tested on db v4 and v5.

ADD COMMENTlink modified 5 months ago • written 5 months ago by massa.kassa.sc3na270

NCBI switched to v5 databases as of Feb 4th, 2020. This still seems to be working fine as far as I can tell.

ADD REPLYlink modified 5 months ago • written 5 months ago by genomax87k

Yes I agree, I've also tested it on db v5.

ADD REPLYlink written 5 months ago by massa.kassa.sc3na270
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1798 users visited in the last hour