Automatic Data Extraction From Timetree
3
1
Entering edit mode
11.0 years ago
Biojl ★ 1.7k

Anyone knows how to programatically extract information from http://timetree.org/

I have to build a 40x40 matrix with information about species time of divergence and my wrist is starting to hurt since I have to do all the pairwise combinations manually

UPDATE: The provided solutions stopped working

evolution tree • 4.0k views
ADD COMMENT
0
Entering edit mode

Any chance you have or know of a new solution to this problem? Would love to get some of the data off the site.

ADD REPLY
0
Entering edit mode

No, sorry. I stopped using timetree.org since without the allowance to extract data automatically is of little use in science. Just a curiosity to show to friends in the phone.
You can give it a try to DateLife.org (see last response). It didn't worked for me and I don't know if it's still on development. Test it and report your results!

ADD REPLY
4
Entering edit mode
11.0 years ago

say you have a text file containing a list of organisms:

$ cat input.txt
Homo Sapiens
Drosophila melanogaster
Canis lupus familiaris
Escherichia coli

the following bash script send some request with curl and extract the distance with xmllint/xpath

#!/bin/bash
IFS="
"
cat input.txt | tr " " "+" | while read O1
do
cat input.txt | tr " " "+" | while read O2
do
if [[ "${O1}" <  "${O2}" ]]
then
curl -s  "http://timetree.org/index.php?taxon_a=${O1}&taxon_b=${O2}&submit=Search" |\
xmllint --html --format --xpath 'concat("insert into SPECIES(org1,org2,dist) values (__QUOTE____A____QUOTE__,__QUOTE____B____QUOTE__,__QUOTE__",normalize-space(//span[@class="panel year block"][h1]),"__QUOTE__);#")' - 2> /dev/null |\
tr "#" "\n" |
sed -e "s/__A__/${O1}/g" |
sed -e "s/__B__/${O2}/g" |
sed -e "s/__QUOTE__/'/g" |
tr "+" " "
fi
done 
done

Result:

~$ bash organisms.sh 
insert into SPECIES(org1,org2,dist) values ('Drosophila melanogaster','Homo Sapiens','782.7 Million Years Ago');
insert into SPECIES(org1,org2,dist) values ('Drosophila melanogaster','Escherichia coli','2535.8 Million Years Ago');
insert into SPECIES(org1,org2,dist) values ('Canis lupus familiaris','Homo Sapiens','94.2 Million Years Ago');
insert into SPECIES(org1,org2,dist) values ('Canis lupus familiaris','Drosophila melanogaster','782.7 Million Years Ago');
insert into SPECIES(org1,org2,dist) values ('Canis lupus familiaris','Escherichia coli','2535.8 Million Years Ago');
insert into SPECIES(org1,org2,dist) values ('Escherichia coli','Homo Sapiens','2535.8 Million Years Ago');
ADD COMMENT
0
Entering edit mode

That's awesome! Unfortunately it's not working for me. I'm trying to figure out what's happening. I suspect is the --xpath argument in the xmllint. I don't see it in the manual nor I guess what should be doing.

ADD REPLY
1
Entering edit mode
$ xmllint --version
xmllint: using libxml version 20708
compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib
ADD REPLY
0
Entering edit mode

Ok. Apparently I have version 20706. I'll update it!

ADD REPLY
0
Entering edit mode

I'm not sure that will fix it. I saw some versions of xmllint missing the '--xpath' argument. But there are many ways to extract this information .: xslt, /usr/bin/xpath,a simple grep "Million Years", etc...

ADD REPLY
0
Entering edit mode

Finally I decided to implement it in Python. It might be slower but the output is exactly as I want. Your solution was my inspiration, thank you!

ADD REPLY
4
Entering edit mode
11.0 years ago
David W 4.9k

There is no official way to automate this process, but check out the urls

http://timetree.org/index.php?taxon_a=homo&taxon_b=pongo&submit=Search

It should be straight forward to pick your favourite scripting language, build urls for each comparison and (maybe with a bit more difficulty) parse out the dates from the resulting pages.

Just a matter of deciding if the time writing the scripts is worth avoiding the pain in your wrist

ADD COMMENT
3
Entering edit mode

Whoops, my scant answer crossed with Pierre's much more complete one. Should change mine to "do what Pierre says" :-)

ADD REPLY
4
Entering edit mode
11.0 years ago
omeara.brian ▴ 50

Note that TimeTree asks that you don't do this; from the bottom of their page: "Currently large scale, automated, data-mining is not permitted". I haven't tested to see if it's possible (I imagine it would be, though an easy thing to do on their end would be to block your IP eventually), but they don't want you to.

We've been building a more open alternative to TimeTree called DateLife.org. It still needs more trees (TimeTree is much better populated) but we encourage scraping, downloading the source, downloading the set of trees, etc. Let me know if you have patches or more trees for it.

ADD COMMENT
0
Entering edit mode

Very good initiative, I'll take a look. I fail to see why TimeTree does not provide tools to mine their database, to me it's a terrible mistake, encouraging researchers not to use it.

ADD REPLY

Login before adding your answer.

Traffic: 1976 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6