Question

Postgresql: Select Only Taxa Which Are Children Of A Given Taxon Using Parent Taxon Id

0

Entering edit mode

12.0 years ago

Tgh ▴ 10

Hi everyone,

I am stuck with the following problem: From a 'BLAST result' table in a database I want to select only those entries which fulfill some specific requirements (e.g. contain a keyword) and are from species that are 'children' of a given taxon 'XY' with the id 123.
E.g., I want all entries which belong to species which themselves belong to the superkingdom Bacteria. The query that takes into account the specific keyword, etc. works fine.
However, I find it very hard to get a good idea how to write the query so that it only takes the 'children' of a taxon.

The table 'taxonomy' was created from the NCBI taxonomy names.dmp and nodes.dmp files. The taxon_id is the NCBI taxonomy ID, while the other IDs are from the database only. It looks like this:

   id    | taxon_id | parent_taxon_id | taxon_name |     rank     
---------+----------+-----------------+------------+--------------
 3963792 |        2 |          131567 | Bacteria   | superkingdom

The 'BLAST result' table contains the foreign key 'taxonomy_id' of the taxonomy table. It looks like this (I removed some columns with more information on the actual BLAST hit):

   id   | score | tax_id |     blast_spec     | taxonomy_id
--------+-------+--------+--------------------+-------------
 850859 |   722 |   9483 | Callithrix jacchus |     3981316
 850860 |   726 |  10141 | Cavia porcellus    |     3982450

My idea would be, to make a FOR loop and iterate over all results that I get and check, whether the 'taxonomy_id' in there has a parent_taxon_id that has the 123 as its parent_taxon_id.

However, I have absolutely no clue how to design such a loop in SQL and how to make sure it does not run forever. I have not yet understood the concept of joins, but these probably would also help in this particular problem?

This is probably a problem that a person with more experience in SQL than me can easily handle..?

I am very thankful for all your advice!

ncbi taxonomy • 5.3k views

ADD COMMENT • link updated 10.1 years ago by Biostar 20 • written 12.0 years ago by Tgh ▴ 10

0

Entering edit mode

Can you add the table 'BLAST result', and what the foreign keys are?

ADD REPLY • link 12.0 years ago by Niek De Klein ★ 2.6k

0

Entering edit mode

For 'BLAST result' it says: Foreign-key constraints: "BLASTresult_taxonomy_id_fkey" FOREIGN KEY (taxonomy_id) REFERENCES taxonomy(id)

ADD REPLY • link 12.0 years ago by Tgh ▴ 10

0

Entering edit mode

Why does the blast result table have both a tax_id and a taxonomy_id column?

Ah, taxonomy_id refers to taxonomy(id) while tax_id refers to taxonomy(taxon_id). Having redundancy like this is not the best idea.

ADD REPLY • link 12.0 years ago by asjo ▴ 120

score 1 · Answer 1 · 2012-05-04

One solution is to use the connectby() function from the tablefunc contrib extension. In PostgreSQL 9.1+ you install it simply by going CREATE EXTENSION tablefunc;

With a little example taxonomy tree looking like this:

taxonomy=# SELECT * FROM taxonomy ORDER BY id;
 id | taxon_id | parent_taxon_id |        taxon_name        |  rank   
----+----------+-----------------+--------------------------+---------
  1 |        1 |               1 | root                     | no rank
  2 |        2 |               1 | Bacteria                 | no rank
  3 |        3 |               1 | Eukaryota                | no rank
  4 |        4 |               1 | Archaea                  | no rank
  5 |        5 |              11 | Callithrix jacchus       | species
  6 |        6 |              11 | Cavia porcellus          | species
  7 |        7 |              10 | Bacillus subtilis        | species
  8 |        8 |               9 | Saccharomyces cerevisiae | species
  9 |        9 |               3 | Fungi                    | kingdom
 10 |       10 |               2 | Bacillus                 | genus
 11 |       11 |               3 | Mammalia                 | class
 12 |       12 |               4 | Pyrobaculum              | genus
 13 |       13 |              12 | Pyrobaculum neutrophilum | species
(13 rows)

You can use connectby() like this, to get all the nodes under, say, Bacteria:

taxonomy=# SELECT * FROM connectby('taxonomy', 'taxon_id', 'parent_taxon_id', 'taxon_id', '2', 0, '~') AS t(keyid text, parent_keyid text, level int, branch text, pos int);
 keyid | parent_keyid | level | branch | pos 
-------+--------------+-------+--------+-----
 2     |              |     0 | 2      |   1
 10    | 2            |     1 | 2~10   |   2
 7     | 10           |     2 | 2~10~7 |   3
(3 rows)

Ok, we are ready to combine this with your results - just JOIN a connectby() query returning all records from the taxonomy table located under Bacteria:

Now, with a sample result table looking like this:

taxonomy=# SELECT * FROM result;
 id | score | tax_id |        blast_spec        | taxonomy_id 
----+-------+--------+--------------------------+-------------
  1 |   722 |      5 | Callithrix jacchus       |           5
  2 |   726 |      6 | Cavia porcellus          |           6
  3 |   756 |      7 | Bacillus subtilis        |           7
  4 |   781 |      8 | Saccharomyces cerevisiae |           8
(4 rows)

We can get just the results under Bacteria like this:

taxonomy=# SELECT * from result JOIN (SELECT keyid::int FROM (SELECT * FROM connectby('taxonomy', 'taxon_id', 'parent_taxon_id', 'taxon_id', '2', 0, '~') AS t(keyid text, parent_keyid text, level int, branch text, pos int)) AS bacteria) AS b ON taxonomy_id=keyid ORDER BY score;
 id | score | tax_id |    blast_spec     | taxonomy_id | keyid 
----+-------+--------+-------------------+-------------+-------
  3 |   756 |      7 | Bacillus subtilis |           7 |     7
(1 row)

If we want all the Eukaryote results, we join the connectby() starting at taxon_id 3 instead:

taxonomy=# SELECT * from result JOIN (SELECT keyid::int FROM (SELECT * FROM connectby('taxonomy', 'taxon_id', 'parent_taxon_id', 'taxon_id', '3', 0, '~') AS t(keyid text, parent_keyid text, level int, branch text, pos int)) AS eukaryota) AS e ON taxonomy_id=keyid ORDER BY score;
 id | score | tax_id |        blast_spec        | taxonomy_id | keyid 
----+-------+--------+--------------------------+-------------+-------
  1 |   722 |      5 | Callithrix jacchus       |           5 |     5
  2 |   726 |      6 | Cavia porcellus          |           6 |     6
  4 |   781 |      8 | Saccharomyces cerevisiae |           8 |     8
(3 rows)

And finally if we just want the mammals, we can start from taxon_id 11 in the tree:

taxonomy=# SELECT * from result JOIN (SELECT keyid::int FROM (SELECT * FROM connectby('taxonomy', 'taxon_id', 'parent_taxon_id', 'taxon_id', '11', 0, '~') AS t(keyid text, parent_keyid text, level int, branch text, pos int)) AS mammals) AS m ON taxonomy_id=keyid ORDER BY score;
 id | score | tax_id |     blast_spec     | taxonomy_id | keyid 
----+-------+--------+--------------------+-------------+-------
  1 |   722 |      5 | Callithrix jacchus |           5 |     5
  2 |   726 |      6 | Cavia porcellus    |           6 |     6
(2 rows)

You can also use common table expressions to traverse the tree - I have put that in an alternative answer as Biostars only allows 5000 characters in an answer :-)

score 1 · Answer 2 · 2012-05-04

Using the same example data as in my answer using connectby(), here is a solution using commont table expressions instead.

All the results under Bacteria:

taxonomy=# SELECT * FROM result JOIN (WITH RECURSIVE relevant_taxonomy AS (SELECT taxon_id FROM taxonomy WHERE taxon_id=2 UNION ALL SELECT taxonomy.taxon_id FROM taxonomy JOIN relevant_taxonomy ON relevant_taxonomy.taxon_id=taxonomy.parent_taxon_id) SELECT taxon_id FROM relevant_taxonomy) AS subtree ON subtree.taxon_id=result.tax_id;
 id | score | tax_id |    blast_spec     | taxonomy_id | taxon_id 
----+-------+--------+-------------------+-------------+----------
  3 |   756 |      7 | Bacillus subtilis |           7 |        7
(1 row)

All the Eukaryote results:

taxonomy=# SELECT * FROM result JOIN (WITH RECURSIVE relevant_taxonomy AS (SELECT taxon_id FROM taxonomy WHERE taxon_id=3 UNION ALL SELECT taxonomy.taxon_id FROM taxonomy JOIN relevant_taxonomy ON relevant_taxonomy.taxon_id=taxonomy.parent_taxon_id) SELECT taxon_id FROM relevant_taxonomy) AS subtree ON subtree.taxon_id=result.tax_id;
 id | score | tax_id |        blast_spec        | taxonomy_id | taxon_id 
----+-------+--------+--------------------------+-------------+----------
  1 |   722 |      5 | Callithrix jacchus       |           5 |        5
  2 |   726 |      6 | Cavia porcellus          |           6 |        6
  4 |   781 |      8 | Saccharomyces cerevisiae |           8 |        8
(3 rows)

And finally just the mammal results:

taxonomy=# SELECT * FROM result JOIN (WITH RECURSIVE relevant_taxonomy AS (SELECT taxon_id FROM taxonomy WHERE taxon_id=11 UNION ALL SELECT taxonomy.taxon_id FROM taxonomy JOIN relevant_taxonomy ON relevant_taxonomy.taxon_id=taxonomy.parent_taxon_id) SELECT taxon_id FROM relevant_taxonomy) AS subtree ON subtree.taxon_id=result.tax_id;
 id | score | tax_id |     blast_spec     | taxonomy_id | taxon_id 
----+-------+--------+--------------------+-------------+----------
  1 |   722 |      5 | Callithrix jacchus |           5 |        5
  2 |   726 |      6 | Cavia porcellus    |           6 |        6
(2 rows)

I think the CTE - this - solution is more elegant. I would expect it to be more efficient as well, but my test-set is way to small to say.