Hello, I ran blastx against the nr database, using diamond (ran at a hpc cluster). After analysing the results, I noticed that some sequences returned no taxids, although other information such as pident or scientific names were returned. See below a sample of the results I obtained
columns are: qseqid evalue bitscore length pident stitle qcovhsp sscinames staxids
1 TRINITY… 6.20e- 92 283. 202 71.8 XP_031787202.1… 70.9 Nasonia vi… 7425
2 TRINITY… 1.20e-108 330. 164 99.4 NP_729590.1 un… 90.9 Drosophila… 7227;…
3 TRINITY… 3.70e- 25 108. 50 100 XP_020808035.1… 55.1 Drosophila… 7274
4 TRINITY… 3.20e- 48 168. 100 71 XP_014215226.1… 100 Copidosoma… 29053
5 TRINITY… 5.10e-121 352. 172 100 HAH0498887.1 w… 100 N/A NA
6 TRINITY… 1.30e- 11 67.4 46 71.7 EEY2123519.1 s… 54.3 N/A NA
7 TRINITY… 2.10e-104 314. 158 100 ABC86463.1 IP0… 86 Drosophila… 7227
8 TRINITY… 5.20e- 19 86.3 45 100 WP_021218988.1… 55.1 Pseudomona… 43263…
Rows 5 and 6 show "NA" as a result for taxids. But when I search ncbi for the sequence name ("stitle"), I can find a taxid for both these sequences (in this case, E. coli, taxid 562). I have downloaded the ncbi taxdmp and setup the diamond database built to include taxonomy.
Have you ever had this problem, i.e. some sequences failing to receive a taxid during a blastx search? Thank you in advance for any help!
Looks like those hits are to accessions in
Identical Protein Groups
database. My guess is those are not represented in the NCBI taxonomy.