Question

Using Entrez to find the taxonomy for an accession number

0

Entering edit mode

5.4 years ago

Frieda ▴ 60

Hello,

I am using Entrez to find the taxonomy for an accession number. This is how I am looking for it:

esearch -db protein -query "WP_003131952.1" | elink -target taxonomy | efetch -format native -mode xml | grep ScientificName | awk -F ">|<" 'BEGIN{ORS=", ";}{print $3;}'

As seen above, to search for the taxonomy of an accession number, I need to first specify the database. My question is that, imagine there is an accession number, and we do not know the database for this accession number. Is there a way to search all the databases on NCBI?

accession_number taxonomy NCBI entrez • 2.1k views

ADD COMMENT • link updated 5.4 years ago by vkkodali_ncbi ★ 3.8k • written 5.4 years ago by Frieda ▴ 60

1

Entering edit mode

Would it work if you add the possible dbs to a file (db.list) each db in one line and use something like this?

cat db.list | while read line ; do esearch -db ${line} -query "WP_003131952.1" | elink -target taxonomy | efetch -format native -mode xml | grep ScientificName | awk -F ">|<" 'BEGIN{ORS=", ";}{print $3;}' ; done

This command is not efficient because if searches all the databases in the list, so I think you can write a script with if and else/elseif conditions to check each db and only continue if not found (for example continue as long as found==0).

ADD REPLY • link 5.4 years ago by Fatima ▴ 1000

score 0 · Answer 1 · 2020-02-13

Depending on what kind of accessions you have (GenBank, RefSeq or Swiss-Prot), you should be able to tell which database the accession is coming from. Take a look at this page. For example, all RefSeq proteins will have accessions in the format [NAXWY]P_\d+\.\d+. Once you know that, you can filter your input list and use the same EntrezDirect command.

Once you have a filtered list, you can use EntrezDirect as follows:

for acc in `cat accs_list.txt` ; do
    echo -ne "${acc}\t" ;
    elink -db protein -target taxonomy -id ${acc} \
      | efetch -format native -mode xml \
      | xtract -pattern TaxaSet -sep ',' -element ScientificName ; 
done 

## example output
NP_002817     Homo sapiens,cellular organisms,Eukaryota,Opisthokonta,Metazoa,Eumetazoa,...
WP_003131952  Lactococcus,cellular organisms,Bacteria,Terrabacteria group,Firmicutes,...

This will generate a two column table with accession and the lineage in tab-delimited format. However, if you don't particularly care about the accession to lineage map and just want to see a list of lineages for your entire set of accessions then you can use epost as follows:

epost -db protein -format acc -input accs_list.txt \
  | elink -db protein -target taxonomy \
  | efetch -format native -mode xml \
  | xtract -pattern TaxaSet -sep ',' -element ScientificName

## example output
Homo sapiens,cellular organisms,Eukaryota,Opisthokonta,Metazoa,Eumetazoa,...
Lactococcus,cellular organisms,Bacteria,Terrabacteria group,Firmicutes,...