Question

Get species/genus names from NCBI nr protein accession IDs for phylogenetic tree annotation?

0

Entering edit mode

3.3 years ago

izhang • 0

I have a list of protein accession IDs from the NCBI nr database that look like this:

WP_0445013 WP_1884344 TBR13838

These are all bacterial proteins from a range of different bacteria, and I've made a phylogenetic tree based on these proteins. However, the tree annotations are these labels and I want to annotate it with the taxonomy instead. I'm not very familiar with the Entrez system but is there an easy way to replace these accession IDs with the taxonomy of the sequence, such as genus and species names?

Any help is appreciated, thanks!

NCBI sequence • 1.0k views

ADD COMMENT • link 3.3 years ago by izhang • 0

2

Entering edit mode

You could do something like following using EntrezDirect:

$ esearch -db protein -query "WP_000445013.1" | efetch -format docsum | xtract -pattern DocumentSummary -element Organism
Escherichia coli

though the examples numbers you posted don't seem to be correct. WP accessions refer to multiple organisms so keep that in mind.

ADD REPLY • link 3.3 years ago by GenoMax 141k

0

Entering edit mode

Thank you, that works! It seems like my Phylip conversion program truncated some of the accession numbers. I retrieved these proteins from NCBI nr, but is there a place I can download the entire set of complete, annotated bacterial genomes? I'm trying to look at the evolution of a widespread metabolic pathway across all/as many bacteria as possible.

ADD REPLY • link 3.3 years ago by izhang • 0