Using Entrez to find the taxonomy for an accession number
1
0
Entering edit mode
4.2 years ago
Frieda ▴ 60

Hello,

I am using Entrez to find the taxonomy for an accession number. This is how I am looking for it:

esearch -db protein -query "WP_003131952.1" | elink -target taxonomy | efetch -format native -mode xml | grep ScientificName | awk -F ">|<" 'BEGIN{ORS=", ";}{print $3;}'

As seen above, to search for the taxonomy of an accession number, I need to first specify the database. My question is that, imagine there is an accession number, and we do not know the database for this accession number. Is there a way to search all the databases on NCBI?

accession_number taxonomy NCBI entrez • 1.6k views
ADD COMMENT
1
Entering edit mode

Would it work if you add the possible dbs to a file (db.list) each db in one line and use something like this?

cat db.list | while read line ; do esearch -db ${line} -query "WP_003131952.1" | elink -target taxonomy | efetch -format native -mode xml | grep ScientificName | awk -F ">|<" 'BEGIN{ORS=", ";}{print $3;}' ; done

This command is not efficient because if searches all the databases in the list, so I think you can write a script with if and else/elseif conditions to check each db and only continue if not found (for example continue as long as found==0).

ADD REPLY
0
Entering edit mode
4.2 years ago
vkkodali_ncbi ★ 3.7k

Depending on what kind of accessions you have (GenBank, RefSeq or Swiss-Prot), you should be able to tell which database the accession is coming from. Take a look at this page. For example, all RefSeq proteins will have accessions in the format [NAXWY]P_\d+\.\d+. Once you know that, you can filter your input list and use the same EntrezDirect command.

Once you have a filtered list, you can use EntrezDirect as follows:

for acc in `cat accs_list.txt` ; do
    echo -ne "${acc}\t" ;
    elink -db protein -target taxonomy -id ${acc} \
      | efetch -format native -mode xml \
      | xtract -pattern TaxaSet -sep ',' -element ScientificName ; 
done 

## example output
NP_002817     Homo sapiens,cellular organisms,Eukaryota,Opisthokonta,Metazoa,Eumetazoa,...
WP_003131952  Lactococcus,cellular organisms,Bacteria,Terrabacteria group,Firmicutes,...

This will generate a two column table with accession and the lineage in tab-delimited format. However, if you don't particularly care about the accession to lineage map and just want to see a list of lineages for your entire set of accessions then you can use epost as follows:

epost -db protein -format acc -input accs_list.txt \
  | elink -db protein -target taxonomy \
  | efetch -format native -mode xml \
  | xtract -pattern TaxaSet -sep ',' -element ScientificName

## example output
Homo sapiens,cellular organisms,Eukaryota,Opisthokonta,Metazoa,Eumetazoa,...
Lactococcus,cellular organisms,Bacteria,Terrabacteria group,Firmicutes,...
ADD COMMENT
0
Entering edit mode

This is very useful thank you!

ADD REPLY

Login before adding your answer.

Traffic: 2255 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6