What's the most straightfoward (and fast) way of retrieving genbank descriptions based on a list of GI numbers?
I've got the blast tabular output for 54k hits, each with a GI number:
2-2129 gi|514995435|gb|KC795686.1| 100.000 136 0 0 115 250 2358 2493 6.78e-62 246
2-2129 gi|514995431|gb|KC795685.1| 100.000 136 0 0 115 250 2358 2493 6.78e-62 246
2-2129 gi|514883433|gb|JX127248.1| 100.000 136 0 0 115 250 97570 97435 6.78e-62 246
2-2129 gi|500229602|gb|KC619528.1| 100.000 136 0 0 115 250 14268 14403 6.78e-62 246
2-2129 gi|493665175|dbj|AP012055.1| 100.000 136 0 0 115 250 64464 64329 6.78e-62 246
2-2129 gi|403311657|gb|JX262235.1| 100.000 136 0 0 115 250 3322 3187 6.78e-62 246
2-2129 gi|403311642|gb|JX262232.1| 100.000 136 0 0 115 250 3343 3208 6.78e-62 246
2-2129 gi|399573441|gb|JX182975.1| 100.000 136 0 0 115 250 33 168 6.78e-62 246
2-2129 gi|394343076|gb|CP003683.1| 100.000 136 0 0 115 250 2064023 2064158 6.78e-62 246
2-2129 gi|384875611|gb|JQ394799.1| 100.000 136 0 0 115 250 2921 2786 6.78e-62 246
But I'm trying to actually identify what they are at a glance (each line corresponds to a unique Blast hit, as I've already done some filtering, collected from collapsed fastqs( - 600,000 unqiue collapsed reads, giving 191million blast hits (95% ID), subsequently collapsed to 54k unique GI hits) .
What I'd ideally like to do is get the genbank descriptions that correspond to each, and write them to the corresponding line in the blast out (or at least write out a new file with the ANI, query ID and so on.
I'd say from googling, this link: Fetching Genbank Entries For List Of Accession Numbers. seems like it's on a similar track, but I'm not sure what the syntax would be to retrieve the descriptions?
NCBI has phased out GI numbers as of this month. You should use the accession numbers for retrieving the descriptions.
You can put the accession numbers in a file (one entry per line,
sort/unique
them to save time) and then use blastdbcmd tool from blast+ package to retrieve the descriptions by following command (you would need blast indexes from NCBI for this to work)Yeah I had heard that, I was hoping i might not be too late.
Nevertheless, I think I've a semi-working solution now. I ran it against a local (and old) nr database, so I suspect that's why it gave me GI's by default. At any rate, I think I'd have needed too many queries to be manageable on the entrez API with their restrictions, but as it was a local DB, the following works:
blastdbcmd -db /blastdb/blast/nt -entry_batch uni1_uniquesortedGIs.txt -outfmt '%g %t' -target_only -out unit1_GI_matches.txt
Seems we both posted the solution at the same time!