blastx diamond returns taxid "NA" for some sequence queries
8 months ago

Hello, I ran blastx against the nr database, using diamond (ran at a hpc cluster). After analysing the results, I noticed that some sequences returned no taxids, although other information such as pident or scientific names were returned. See below a sample of the results I obtained

columns are: qseqid evalue bitscore length pident stitle qcovhsp sscinames staxids

1 TRINITY… 6.20e- 92 283.    202  71.8 XP_031787202.1…  70.9 Nasonia vi… 7425  
 2 TRINITY… 1.20e-108 330.    164  99.4 NP_729590.1 un…  90.9 Drosophila… 7227;…

3 TRINITY… 3.70e- 25 108.     50 100   XP_020808035.1…  55.1 Drosophila… 7274  
 4 TRINITY… 3.20e- 48 168.    100  71   XP_014215226.1… 100   Copidosoma… 29053 

5 TRINITY… 5.10e-121 352.    172 100   HAH0498887.1 w… 100   N/A         NA    
 6 TRINITY… 1.30e- 11  67.4    46  71.7 EEY2123519.1 s…  54.3 N/A         NA    
 7 TRINITY… 2.10e-104 314.    158 100   ABC86463.1 IP0…  86   Drosophila… 7227  
 8 TRINITY… 5.20e- 19  86.3    45 100   WP_021218988.1…  55.1 Pseudomona… 43263…

Rows 5 and 6 show "NA" as a result for taxids. But when I search ncbi for the sequence name ("stitle"), I can find a taxid for both these sequences (in this case, E. coli, taxid 562). I have downloaded the ncbi taxdmp and setup the diamond database built to include taxonomy.

Have you ever had this problem, i.e. some sequences failing to receive a taxid during a blastx search? Thank you in advance for any help!

Looks like those hits are to accessions in Identical Protein Groups database. My guess is those are not represented in the NCBI taxonomy.


