Question

Tutorial:Getting the lineage of an organism from NCBI taxonomy data

0

Entering edit mode

6.9 years ago

vimalkvn ▴ 320

This is a method I followed to get an organism's complete lineage from NCBI taxonomy data. The procedure involves:

Creating a MySQL database.
Initialising the database with the BioSQL schema.
Downloading and importing the NCBI taxonomy data into the database.
Using a python script to query the database and return an organism's complete lineage.

Much of the information in this post comes from the documentation of the BioSQL and BioPython projects. In addition, I wrote a python script to query the database and output results.

Here is how the script works:

python3 lineager.py -n Bos taurus

Output:

2018-07-14 04:47 INFO     Processing organism name provided
                          at the command line: Bos taurus
Organism,Lineage
Bos taurus,cellular organisms;Eukaryota;Opisthokonta;Metazoa;
Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;
Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;
Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;
Eutheria;Boreoeutheria;Laurasiatheria;Cetartiodactyla;
Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus

Instead of a name, an input file containing names of organisms can be provided using the -f option.

Requirements

I'm doing this on BioVM (a virtual machine based on Ubuntu Linux 16.04) but these steps can be performed on any system where the following programs can be installed. Versions I used are in brackets:

MySQL (5.7.22) for the database.
Perl (5.22.1) including the DBI module and the DBD module for MySQL for initialising the database with the biosql schema and for the load_taxonomy.pl script.
Python (3.5.2) and the MySQLdb module (1.3.7) for the lineager.py script.

All these steps are done as a user (here it is biovm). sudo privileges are required for installing packages, creating the database and creating the database user.

Install the MySQL database server:

sudo apt install -y mysql-server

During the installation, there will be a request to set a password for the mysql admin user (-u root in mysql commands). You will need this password to connect and create the database later.

After installation, you can check if the MySQL server is up and running using:

sudo netstat -tlpn | grep mysql
tcp        0      0 127.0.0.1:3306          0.0.0.0:*               LISTEN      3113/mysqld

Install the Perl DBI, DBD modules and the Python MySQL client libraries required for interacting with the MySQL database:

sudo apt install -y libdbi-perl libdbd-mysql-perl \
libmysqlclient-dev python3-mysqldb

Create the biosql database

Login as the mysql admin user:

mysql -u root -p

Create the database. Note the use of back ticks ` instead of single quotes ' for the database name:

create database `biosql` collate 'utf8_general_ci';

Create the database user and grant permissions to the biosql database. note: set a strong password under identified by:

create user `biovm`@`localhost` identified by 'your-password-here';
grant all privileges on `biosql`.* to `biovm`@`localhost`;

Quit the mysql shell by typing exit.

For the database connection to work in the lineager.py script, create a file .my.cnf in the user's HOME directory with the following content:

[client]
user = biovm
password = your-password-here
database = biosql

Download the latest release of the BioSQL schema:

wget -c https://github.com/biosql/biosql/archive/biosql-release-1-0-1.tar.gz

Decompress the archive, change directory:

tar zxvf biosql-release-1-0-1.tar.gz
cd biosql-biosql-release-1-0-1

Initialise the database with the BioSQL schema

Edit file biosqldb-mysql.sql in the sql directory and remove TYPE=INNODB from all CREATE TABLE statements. See footnotes for the reason.

Initialise database by executing the SQL script:

mysql -u biovm -D biosql < sql/biosqldb-mysql.sql

Import NCBI taxonomy data

While still in the biosql-biosql-release-1-0-1 directory, create a directory to store the taxonmy data:

mkdir taxdata

Download the taxonomy database, verify integrity of the downloaded archive and then uncompress the archive:

cd taxdata
wget -c https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
wget https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz.md5

md5sum -c taxdump.tar.gz.md5
# this should return
taxdump.tar.gz: OK

# uncompress archive
tar zxvf taxdump.tar.gz

# Move to parent directory
cd ..

Import taxonomy data into the biosql database. For the script to connect to the database, the DBI_PASSWORD environment variable will need to be set. The password is the one set for the database user earlier:

export DBI_PASSWORD='your-password-here'

Now load the data:

perl scripts/load_ncbi_taxonomy.pl --dbname biosql --dbuser biovm

This will take some time to complete:

Loading NCBI taxon database in taxdata:
... retrieving all taxon nodes in the database
... reading in taxon nodes from nodes.dmp
... insert / update / delete taxon nodes
... updating new parent IDs
... (committing nodes)
... rebuilding nested set left/right values
... reading in taxon names from names.dmp
... deleting old taxon names
... inserting new taxon names
... cleaning up
Done.

To view all the options the script supports, install the perl-doc program and then use the --help option:

perl scripts/load_ncbi_taxonomy.pl --help

Run lineager.py to get lineage

Clone the git repository:

cd
git clone --depth 1 https://gitlab.com/vimalkvn/lineager

Run script:

cd lineager
python3 lineager.py -n Escherichia coli

If you have a file input.txt containing names like this:

Canis lupus familiaris
Bos taurus
Escherichia
AMBIGUOUS
Arabidopsis thaliana

then run the script like this:

python3 lineager.py -f input.txt

At the end of the run, an output file lineage.csv will be generated in the same directory.

Footnotes

Without this, the following error will occur when attempting to initialise the database:

ERROR 1064 (42000) at line 49: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'TYPE=INNODB' at line 8.

This happens because the TYPE=INNODB clause was changed to ENGINE=INNODB since MySQL version 5.5. As INNODB is the default, it is not necessary to specify it in the CREATE TABLE statements.

biosql python taxonomy • 3.4k views

ADD COMMENT • link updated 2.3 years ago by Ram 45k • written 6.9 years ago by vimalkvn ▴ 320

0

Entering edit mode

This looks great! I was looking for such functionality. I am surprised this cannot be done by using a function from BioSQL.py. Is it possible to modify the code to be compatible with import mysql.connector ?

ADD REPLY • link 4.6 years ago by biofysikos • 0