Extracting taxonomic information for species from NCBI using command line
0
0
Entering edit mode
10 months ago
Mani • 0

Hi,

I have a file containing scientific names for a number of species (one per line) and want to extract major taxonomic information for these species ( class, order, family) using command line and write it in an output tab separated file. For example my input file is:

cat input.txt

Danio rerio
Carassius gibelio
Mugil cephalus

and I want the output to be :

cat output.txt

Danio rerio     Actinopterygii  Cypriniformes   Cyprinidae
Carassius gibelio   Actinopterygii  Cypriniformes   Cyprinidae
Mugil cephalus  Actinopterygii  Mugiliformes    Mugilidae

I have a python script for this, but it does not extract the correct taxonomy for all queries. Here is my script:

cat myscript.py

import sys
from Bio import Entrez

def get_taxonomic_info(species_name):
    try:
        Entrez.email = "xxxx"  # Add your email here

        # Search for the species in the Taxonomy database
        search_handle = Entrez.esearch(db="taxonomy", term=species_name, retmode="xml")
        search_results = Entrez.read(search_handle)

        # Check if any results are found
        if search_results["Count"] == "0":
            print(f"Error: {species_name} not found in the NCBI Taxonomy database")
            return "N/A", "N/A", "N/A"

        # Retrieve taxonomic information using the first result
        tax_id = search_results["IdList"][0]
        tax_handle = Entrez.efetch(db="taxonomy", id=tax_id, retmode="xml")
        tax_record = Entrez.read(tax_handle)

        # Extract class, order, and family from the taxonomic information
        lineage = tax_record[0].get("LineageEx", [])

        class_info = next((entry["ScientificName"] for entry in lineage if entry["Rank"] == "class"), "N/A")
        order_info = next((entry["ScientificName"] for entry in lineage if entry["Rank"] == "order"), "N/A")
        family_info = next((entry["ScientificName"] for entry in lineage if entry["Rank"] == "family"), "N/A")

        return class_info, order_info, family_info

    except Exception as e:
        print(f"Error: An unexpected error occurred for {species_name}")
        print(f"Details: {e}")
        return "N/A", "N/A", "N/A"

def process_input(input_file, output_file):
    with open(input_file, "r") as infile, open(output_file, "w") as outfile:
        for line in infile:
            species_name = line.strip()
            class_info, order_info, family_info = get_taxonomic_info(species_name)
            outfile.write(f"{species_name}\t{class_info}\t{order_info}\t{family_info}\n")

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python3 add_taxonomic_info.py input.txt output.txt")
        sys.exit(1)

    input_file = sys.argv[1]
    output_file = sys.argv[2]

    process_input(input_file, output_file)

when I run it:

python3 input.txt output.txt

and here is the output:

cat output.txt

Danio rerio Actinopteri Cypriniformes   Danionidae
Carassius gibelio   Actinopteri Cypriniformes   Cyprinidae
Mugil cephalus  Actinopteri Mugiliformes    Mugilidae

Any help would be appreciated!

Cheers,

Mani

NCBI taxid taxonomy • 675 views
ADD COMMENT
1
Entering edit mode

Posting as an additional command line option. Using Entrezdirect one can get:

$ esearch -db taxonomy -query "Mugil cephalus" | efetch -format native -mode xml | xtract -pattern Taxon -block "*/Taxon" -unless Rank -equals "no rank" -tab "\n" -element Rank,ScientificName | grep -e "superclass" -e "order" -e "family" | sort -usr

superclass  Actinopterygii
order   Mugiliformes
family  Mugilidae
ADD REPLY
0
Entering edit mode

but it does not extract the correct taxonomy for all queries

That may be par for course. Not every entrez query returns complete information from taxonomy database. You are going to get information where it exists in the database.

BTW the output you are showing above seems to match what you want.

ADD REPLY
0
Entering edit mode

As noted by GenoMax , the outcome depends on the available information in the database, so some queries may yield missing data. Having said this, I have a small piece of code in R that can work here.

For your example :

# Dependencies
library(XML)
library(rentrez)

# Define a vector with species name 
orgs = c("Danio rerio", "Carassius gibelio", "Mugil cephalus")

# Get taxonomy ids for each species
taxids <- c()
for(org in orgs) {
foo = entrez_search(db = "taxonomy", term = paste0(org,"[SCIN]"))
taxids <- c(taxids, foo$ids)
}

# Programatic access to ENTREZ : xml file
source("https://raw.githubusercontent.com/jdieramon/my_scripts/master/byRequest/functions.R")

vals <- get_taxonomy(taxids)
vals

Output :

> vals
# A tibble: 3 × 6
Spp               superkingdom phylum   class       order         family    
<chr>             <chr>        <chr>    <chr>       <chr>         <chr>     
1 Danio rerio       Eukaryota    Chordata Actinopteri Cypriniformes Danionidae
2 Carassius gibelio Eukaryota    Chordata Actinopteri Cypriniformes Cyprinidae
3 Mugil cephalus    Eukaryota    Chordata Actinopteri Mugiliformes  Mugilidae 
ADD REPLY

Login before adding your answer.

Traffic: 1228 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6