Question

Getting full taxonomy from BLAST results without "staxids" in output

0

Entering edit mode

8.4 years ago

ScubaChris ▴ 10

Hi everyone,

long story short, I dun goof'd: I ran DIAMOND on a huge number on metagenomics samples, but I didn't include the "staxids" parameter in the output. (In my puny defense, this parameter wasn't mentioned in the DIAMOND manual). Now I have a couple of hundred thousand output lines looking like this:

042SRF022_1 gi|751637161|ref|WP_041104882.1|    40.4    151 82  2   999 547 1   143 2.8e-21 110.9

Is there a sane way of getting the taxonomy for each output line so I can create a report without having to run the entire thing again? I tried getting the "gi_taxid_nucl.dmp.gz" from NCBI and running grep on each gi, but it a) takes ages and b) doesn't seem to work. I am thinking of putting the entire file in an sql db and start running queries on it. Any ideas welcome.

taxonomy blast diamond metagenomics • 2.6k views

ADD COMMENT • link updated 8.4 years ago by Pierre Lindenbaum 166k • written 8.4 years ago by ScubaChris ▴ 10

score 3 · Accepted Answer · 2017-02-24

I tried getting the "gi_taxid_nucl.dmp.gz" from NCBI and running grep on each gi, but it a) takes ages

because it's the wrong method: extract the gi from the blast output and sort on this column

awk -F '|' '{printf("%s\t%s\n",$2,$0);}' | sort -t $'\t' -k1,1

sort gi_taxid_nucl.dmp.gz on the gi column

and then use linux join to merge both files.

and b) doesn't seem to work.

because with only grep "2" you'll get "2" and "22" and "222" and "gene2" etc..