Question: Retrieve species name using taxaIDs of NCBI
0
gravatar for chetana
3.6 years ago by
chetana40
San Diego
chetana40 wrote:

Hi everyone,

I have a long list of taxaID, I want to map them to get the scientific names (species) and also lineage. I have looked at the names.dmp file that maps the taxaID and names. Tried to pull the ones I wanted using python but the names.dmp file has multiple rows for particular taxaID and I only need scientific name. So I'm not sure how to proceed with this. I've even tried the Entrez efetch but I guess it needs an xml input, I just have .txt file with a list of TaxaIDs in there. I'm quite new to Bioinformatics any help and suggestions are appreciated. Thanks in advance!

ADD COMMENTlink modified 3.2 years ago by -_-850 • written 3.6 years ago by chetana40

The file 'names.dmp' has four columns. The first column is the taxid, the second column is a name, and the fourth column is the class of the name. A taxid may have assigned several names but each of these names has a different 'class'. Every taxid has exactly one name of class 'scientific name', while the other classes are optional. Thus you can restrict your search to lines having 'scientific name' in the forth column. Please compare the output of these two awk searches:

awk -F '|' '$1==9606' names.dmp

awk -F '|' '$1==9606 && $4~/scientific name/' names.dmp

Unfortunately, 'names.dmp' is a bit nasty to parse due to abundant and unnesserary white space in it.

ADD REPLYlink written 3.5 years ago by piet1.7k
2
gravatar for Juan Manuel Berros
3.5 years ago by
Buenos Aires, Argentina
Juan Manuel Berros80 wrote:

I'm adding a Python solution that uses Biopython. I feel that although wordier, it is more scalable and readable than the concatenation of pipes. You just need to specify the filename with your tax IDs; here, I've used human and cat IDs as an example:

The output can be dumped to a file and read as a CSV:

Homo sapiens,cellular organisms >  Eukaryota >  Opisthokonta >  Metazoa >  Eumetazoa >  Bilateria >  Deuterostomia >  Chordata >  Craniata >  Vertebrata >  Gnathostomata >  Teleostomi >  Euteleostomi >  Sarcopterygii >  Dipnotetrapodomorpha >  Tetrapoda >  Amniota >  Mammalia >  Theria >  Eutheria >  Boreoeutheria >  Euarchontoglires >  Primates >  Haplorrhini >  Simiiformes >  Catarrhini >  Hominoidea >  Hominidae >  Homininae >  Homo
Felis catus,cellular organisms >  Eukaryota >  Opisthokonta >  Metazoa >  Eumetazoa >  Bilateria >  Deuterostomia >  Chordata >  Craniata >  Vertebrata >  Gnathostomata >  Teleostomi >  Euteleostomi >  Sarcopterygii >  Dipnotetrapodomorpha >  Tetrapoda >  Amniota >  Mammalia >  Theria >  Eutheria >  Boreoeutheria >  Laurasiatheria >  Carnivora >  Feliformia >  Felidae >  Felinae >  Felis

Cheers!

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Juan Manuel Berros80
2
gravatar for shenwei356
3.5 years ago by
shenwei3565.2k
China
shenwei3565.2k wrote:

Try TaxonKit (Cross-platform and Efficient NCBI Taxonomy Toolkit) with the lineage subcommand (usage which querys full lineage of given taxids from file.

TaxonKit is a command-line tool written in Go programming language, executable binary files for most popular operating system are freely available in download page. Just download compressed executable file of your operating system, uncompress it and run.

It's very fast!

NCBI taxonomy data is needed: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz

Example data:

$ cat t.taxid
349741
834

Query lineage:

$ taxonkit lineage --nodes nodes.dmp --names names.dmp  t.taxid
349741  cellular organisms;cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835
834     cellular organisms;cellular organisms;Bacteria;FCB group;Fibrobacteres;Fibrobacteria;Fibrobacterales;Fibrobacteraceae;Fibrobacter;Fibrobacter succinogenes;Fibrobacter succinogenes subsp. succinogenes

Qiime-like format can be obtained by flag -f:

$ taxonkit lineage --nodes nodes.dmp --names names.dmp -f t.taxid
349741  k__Bacteria;p__Verrucomicrobia;c__Verrucomicrobiae;o__Verrucomicrobiales;f__Akkermansiaceae;g__Akkermansia;s__Akkermansia muciniphila
834     k__Bacteria;p__Fibrobacteres;c__Fibrobacteria;o__Fibrobacterales;f__Fibrobacteraceae;g__Fibrobacter;s__Fibrobacter succinogenes;S__Fibrobacter succinogenes subsp. succinogenes

You can also extract custom levels of rank with reformat (usage). The default format is {k};{p};{c};{o};{f};{g};{s}:

$ taxonkit lineage --nodes nodes.dmp --names names.dmp t.taxid | cut -f 2 | taxonkit reformat | cut -f 2
Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila
Bacteria;Fibrobacteres;Fibrobacteria;Fibrobacterales;Fibrobacteraceae;Fibrobacter;Fibrobacter succinogenes
ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by shenwei3565.2k
1
gravatar for Prasad
3.6 years ago by
Prasad1.6k
India
Prasad1.6k wrote:

efetch does not need a xml input. here is the linux command line solution,

for i in `cat file`; do curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=$i&rettype=docsum&retmode=text" | head -1 | sed -e 's/1. //g' | awk -F "\t" '{print '$i'"\t"$0}'; done;

where file is file with all the taxa ID one per line

ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by Prasad1.6k

Thanks for the reply Prasad, it worked. Is there a way I can get full lineage using TaxaIDs? Thank you.

ADD REPLYlink written 3.6 years ago by chetana40

just remove the rettype and retmode from efetch link which gives you xml from there you can parse full lineage

ADD REPLYlink written 3.6 years ago by Prasad1.6k

for example sake,

curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=9606" | grep -iw lineage | perl -ne '{if(/.*?\>(.*?)\<\/Lineage\>/){print $1,"\n";}}'
ADD REPLYlink written 3.6 years ago by Prasad1.6k
0
gravatar for -_-
3.2 years ago by
-_-850
Canada
-_-850 wrote:

I converted the whole taxdump into a csv file of lineages, each identified by a tax id, https://github.com/zyxue/ncbitax2lin. You may find it helpful.

ADD COMMENTlink written 3.2 years ago by -_-850
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1262 users visited in the last hour