From accession ID to phylum taxid
Entering edit mode
2.3 years ago

Hello guys,

I have a list of 10 000 accessions id from blast in a txt file (XP_002184977.1 GBG35237.1) and I would like to have the taxonomy associated (in a tab file that correpond to my accessions line per line). So I'm using Entrez direct like this :

esearch -db protein -query "XP_002184977.1" \
  | elink -target taxonomy \
  | efetch -format native -mode xml \
  | xtract -pattern Taxon -block "*/Taxon" -unless Rank -equals "no rank" -tab "\n" -element Rank,TaxId,ScientificName

As a result I have different lines for each taxonomy level

superkingdom    2759    Eukaryota
clade   2698737 Sar
clade   33634   Stramenopiles
clade   2696291 Ochrophyta
phylum  2836    Bacillariophyta
class   33849   Bacillariophyceae
clade   33850   Bacillariophycidae
order   38748   Naviculales
family  38749   Phaeodactylaceae
genus   2849    Phaeodactylum
species 2850    Phaeodactylum tricornutum

However the output is not always consistant according to the hit accession (sometimes there is no line phylum which is the line that interest me). So if I had a "grep phylum" I can't concatenaate my hit accession with the results...

Any ideas on how to deal with that? It would be great if i could have the whole taxonomy with "N.A" in the column phylum if the information is not present in the database. I have tried some other tools like taxonomisr, unsuccessfully ...

RNA-Seq • 1.1k views
Entering edit mode

There is going to be no good solution for Entrezdirect. If that information is not there then it is not going to show up. I would have said get the tax dump file from NCBI but since EntrezDirect is looking at that same database the result will not be different.

Entering edit mode

Thanks for your advice, you are right. Then the best for me would be to have the full information like for each lvl, the result (for instance "human" fo species or "Non assigned" if not found) but I don't know how to do that.

Or a less ambitious way would be to get the phylum with "grep phylum", and if it does not find anything, add a "0" for instance so it stills add a line in my file (how to do that though..), so I can match my results with my accession file.

By the way is there a way to do this search with dowloading the database to gain some time ? Because 10000 accessions seem too much for edirect and I have to split my files.

Entering edit mode

The fastest way would be to download the base files from NCBI and parse the information you need from them. This is the approach taken by several tools that need to gather taxonomic information from NCBI accessions, such as BlobTools or KronaTools.

Entering edit mode

Ok thanks ! I'm open to new tools too. i have checked Blobtools doc ( But I haven't use it before and I don't really get exactly how to use my data with it

Entering edit mode

You can find accession2taxid files in this folder at NCBI's FTP site. You could get that file and then parse locally. I don't recall if that file will give you all the intermediate levels.

Entering edit mode
2.3 years ago

Some species indeed have no phylum taxon level, so you can not retrieve the phylum taxid. e.g.,

echo 314101 \
    | taxonkit lineage \
    | taxonkit reformat \
    | csvtk -H -t cut -f 1,3 \
    | csvtk -H -t sep -f 2 -s ';' -R \
    | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \
    | csvtk pretty -t 
taxid    kindom     phylum   class   order   family   genus   species
314101   Bacteria                                             uncultured murine large bowel bacterium BAC 54B

But you can assign it a pseudo taxon name in downstream analysis.

Here's a solution with my taxonkit (v0.6.1rc1 or later versions). prot.accession2taxid.gz is needed. csvtk and taxonkit are available on bioconda.

cat acc.txt

# 1) get taxid for all accession
# 7min
time pigz -dc prot.accession2taxid.gz \
    | csvtk grep -t -f accession.version -P acc.txt \
    | csvtk cut -t -f accession.version,taxid \
    | sed 1d \
    > acc2taxid.txt

cat acc2taxid.txt
XP_002184977.1  556484
AAX16399.1  314101

# 2) get taxonomy
# 6s
time cat acc2taxid.txt \
    | taxonkit lineage -i 2 \
    | taxonkit reformat -f '{p}' -t --fill-miss-rank -i 3 \
    | csvtk cut -t -f 1,2,4,5 \
    | csvtk add-header -t -n accession,taxid,phylum,phylum-taxid \
    | csvtk pretty -t 
accession        taxid    phylum                         phylum-taxid
XP_002184977.1   556484   Bacillariophyta                2836
AAX16399.1       314101   unclassified Bacteria phylum

There's a bug here, it supposed to be unclassified Bacteria phylum (update: fixed, use v0.6.1 or later versions).

--fill-miss-rank in taxonkit reformat is for fillling missing rank with higher taxon.

Entering edit mode

The bug is fixed. Use this patch v0.6.1rc1


Login before adding your answer.

Traffic: 1459 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6