Question: Using Entrez to find the taxonomy for an accession number
0
gravatar for Freddy
5 months ago by
Freddy20
Freddy20 wrote:

Hello,

I am using Entrez to find the taxonomy for an accession number. This is how I am looking for it:

esearch -db protein -query "WP_003131952.1" | elink -target taxonomy | efetch -format native -mode xml | grep ScientificName | awk -F ">|<" 'BEGIN{ORS=", ";}{print $3;}'

As seen above, to search for the taxonomy of an accession number, I need to first specify the database. My question is that, imagine there is an accession number, and we do not know the database for this accession number. Is there a way to search all the databases on NCBI?

ADD COMMENTlink modified 5 months ago by vkkodali2.0k • written 5 months ago by Freddy20
1

Would it work if you add the possible dbs to a file (db.list) each db in one line and use something like this?

cat db.list | while read line ; do esearch -db ${line} -query "WP_003131952.1" | elink -target taxonomy | efetch -format native -mode xml | grep ScientificName | awk -F ">|<" 'BEGIN{ORS=", ";}{print $3;}' ; done

This command is not efficient because if searches all the databases in the list, so I think you can write a script with if and else/elseif conditions to check each db and only continue if not found (for example continue as long as found==0).

ADD REPLYlink modified 5 months ago • written 5 months ago by Fatima590
0
gravatar for vkkodali
5 months ago by
vkkodali2.0k
United States
vkkodali2.0k wrote:

Depending on what kind of accessions you have (GenBank, RefSeq or Swiss-Prot), you should be able to tell which database the accession is coming from. Take a look at this page. For example, all RefSeq proteins will have accessions in the format [NAXWY]P_\d+\.\d+. Once you know that, you can filter your input list and use the same EntrezDirect command.

Once you have a filtered list, you can use EntrezDirect as follows:

for acc in `cat accs_list.txt` ; do
    echo -ne "${acc}\t" ;
    elink -db protein -target taxonomy -id ${acc} \
      | efetch -format native -mode xml \
      | xtract -pattern TaxaSet -sep ',' -element ScientificName ; 
done 

## example output
NP_002817     Homo sapiens,cellular organisms,Eukaryota,Opisthokonta,Metazoa,Eumetazoa,...
WP_003131952  Lactococcus,cellular organisms,Bacteria,Terrabacteria group,Firmicutes,...

This will generate a two column table with accession and the lineage in tab-delimited format. However, if you don't particularly care about the accession to lineage map and just want to see a list of lineages for your entire set of accessions then you can use epost as follows:

epost -db protein -format acc -input accs_list.txt \
  | elink -db protein -target taxonomy \
  | efetch -format native -mode xml \
  | xtract -pattern TaxaSet -sep ',' -element ScientificName

## example output
Homo sapiens,cellular organisms,Eukaryota,Opisthokonta,Metazoa,Eumetazoa,...
Lactococcus,cellular organisms,Bacteria,Terrabacteria group,Firmicutes,...
ADD COMMENTlink written 5 months ago by vkkodali2.0k

This is very useful thank you!

ADD REPLYlink written 4 months ago by Biostar0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2091 users visited in the last hour