Querying nr_clustered database using BLAST REST API?
1
0
Entering edit mode
3 months ago
cmdcolin ★ 3.8k

Hi there, I have managed to query the nr database with the BLAST REST API using code from their web_blast perl script, but i have not figured out how to query the nr_clustered database

Does anyone know what parameters can be used to query nr_clustered via the API?

the web_blast perl script is here

https://blast.ncbi.nlm.nih.gov/docs/web_blast.pl

Just changing out nr for nr_clustered in the command line args does not work

blast clustered • 423 views
ADD COMMENT
0
Entering edit mode
3 months ago
GenoMax 141k

NCBI does not make all the databases available for remote/web blast. While this is a couple of years old this was what I had discovered: Blast+ remote database names

nr_clustered is going to be made available for download at some point in future per NCBI spokesperson who participates here. Limitation was how to deal with the multiple sequence headers that contribute to a cluster as I recall from the discussion.

ADD COMMENT
0
Entering edit mode

thanks for the cross reference. I am still debugging but I might have actually gotten the API to return clustered results using the database name nr_cluster_seq. I will post an update if I can confirm

ADD REPLY
0
Entering edit mode

Can confirm that database name you discovered is working with command line remote blast+ (v.2.15). Of course if you have many/large queries ....

blastp -db nr_cluster_seq -query prot.fa -remote -out test_clust

Note: It is producing results that look like "normal" nr blast though i.e. single fasta header lines in hits. Result does have the following

Database: clustered nr
           307,652,175 sequences; 101,831,877,663 total letters
ADD REPLY
0
Entering edit mode

ya that is something i was checking into, it seemed like the results didn't really display the 'clustered' type results that the web ui shows, but will keep looking

ADD REPLY
1
Entering edit mode

I confirmed that the results of search against plain nr are different and clearly show that database being used.

From the clustered search on web (see below), the command line blast output seems to be selecting the top hit as shown in the clustered database (yellow highlight below) in results we get.

My assumption is blast+ code (released to public) does not have the necessary bits to show clustered headers/results. This matches the output we currently get from the command line search against clustered_nr.

clustered


Clicking on the Download button in web clustered blast allows one to download clustered output. I assume this format will make it into public blast+ code when NCBI is ready to release clusterd_nr as a public download.

Note: Only the sequence of the top "hit" is shown in the alignment though you can see members of the cluster in a separate section.

Query: gi|81097752|gb|AAI09473.1| Bhmt protein [Danio rerio] ID: lcl|Query_8988368(amino acid) Length: 400
Database: nr_clustered(experimental) clustered nr

Clusters producing significant alignments:

Cluster:           NP_001012498.2 betaine--homocysteine S-methyltransferase 1 [Danio rerio]
Num Members:       13
Num Taxa:          11
Scientific Name:   Otophysi
Common Name :      bony fishes
Taxid:             186626
Highest Bit Score: 828
Total Bit Score:   828
Percent Coverage:  100%
Evalue:            0.0
Percent Identity:  99.50%
Accession Length:  400

13 cluster member(s):
Accession        Scientific                     Common                         Taxid      
NP_001012498.2   Danio rerio                    zebrafish                      7955       
AAV74219.1       Danio rerio                    zebrafish                      7955       
KAK2872183.1     Cirrhinus molitorella          mud carp                       172907     
XP_026108603.1   Carassius auratus              goldfish                       7957       
XP_026118749.1   Carassius auratus              goldfish                       7957       
XP_026860087.1   Electrophorus electricus       electric eel                   8005       
XP_055035158.1   Misgurnus anguillicaudatus     oriental weatherfish           75329      
XP_056110631.1   Rhinichthys klamathensis go... bony fishes                    3034132    
XP_056301691.1   Danio aesculapii               bony fishes                    1142201    
XP_056586892.1   Triplophysa dalaica            bony fishes                    1582913    
XP_057184194.1   Triplophysa rosa               bony fishes                    992332     
XP_058613512.1   Onychostoma macrolepis         bony fishes                    369639     
XP_059398410.1   Carassius carassius            crucian carp                   217509     


Alignments:

>betaine--homocysteine S-methyltransferase 1 [Danio rerio]
Sequence ID: NP_001012498.2 Length: 400
Range 1: 1 to 400

Score:828 bits(2140), Expect:0.0, 
Method:Compositional matrix adjust., 
Identities:398/400(99%), Positives:398/400(99%), Gaps:0/400(0%)

Query  1    MAPVGSKRGVLERLNAGEVVIGDGGFVFALEKRGYVKAGPWTPEAAAEHPEAVRQLHREF  60
            MAPVGSKRGVLERLNAGEVVIGDGGFVFALEKRGYVKAGPWTPEAAAEHPEAVRQLHREF
Sbjct  1    MAPVGSKRGVLERLNAGEVVIGDGGFVFALEKRGYVKAGPWTPEAAAEHPEAVRQLHREF  60
ADD REPLY

Login before adding your answer.

Traffic: 2201 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6