long story short, I dun goof'd: I ran DIAMOND on a huge number on metagenomics samples, but I didn't include the "staxids" parameter in the output. (In my puny defense, this parameter wasn't mentioned in the DIAMOND manual). Now I have a couple of hundred thousand output lines looking like this:
042SRF022_1 gi|751637161|ref|WP_041104882.1| 40.4 151 82 2 999 547 1 143 2.8e-21 110.9
Is there a sane way of getting the taxonomy for each output line so I can create a report without having to run the entire thing again? I tried getting the "gi_taxid_nucl.dmp.gz" from NCBI and running grep on each gi, but it a) takes ages and b) doesn't seem to work. I am thinking of putting the entire file in an sql db and start running queries on it. Any ideas welcome.