Question: Automatic Data Extraction From Timetree
1
gravatar for Biojl
6.0 years ago by
Biojl1.6k
Barcelona
Biojl1.6k wrote:

Anyone knows how to programatically extract information from http://timetree.org/

I have to build a 40x40 matrix with information about species time of divergence and my wrist is starting to hurt since I have to do all the pairwise combinations manually

UPDATE: The provided solutions stopped working

evolution tree • 2.4k views
ADD COMMENTlink modified 2.5 years ago by Biostar ♦♦ 20 • written 6.0 years ago by Biojl1.6k

Any chance you have or know of a new solution to this problem? Would love to get some of the data off the site.

ADD REPLYlink written 3.7 years ago by UnivStudent380

No, sorry. I stopped using timetree.org since without the allowance to extract data automatically is of little use in science. Just a curiosity to show to friends in the phone.
You can give it a try to DateLife.org (see last response). It didn't worked for me and I don't know if it's still on development. Test it and report your results!

ADD REPLYlink written 3.7 years ago by Biojl1.6k
4
gravatar for Pierre Lindenbaum
6.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

say you have a text file containing a list of organisms:

$ cat input.txt
Homo Sapiens
Drosophila melanogaster
Canis lupus familiaris
Escherichia coli

the following bash script send some request with curl and extract the distance with xmllint/xpath

#!/bin/bash
IFS="
"
cat input.txt | tr " " "+" | while read O1
do
cat input.txt | tr " " "+" | while read O2
do
if [[ "${O1}" <  "${O2}" ]]
then
curl -s  "http://timetree.org/index.php?taxon_a=${O1}&taxon_b=${O2}&submit=Search" |\
xmllint --html --format --xpath 'concat("insert into SPECIES(org1,org2,dist) values (__QUOTE____A____QUOTE__,__QUOTE____B____QUOTE__,__QUOTE__",normalize-space(//span[@class="panel year block"][h1]),"__QUOTE__);#")' - 2> /dev/null |\
tr "#" "\n" |
sed -e "s/__A__/${O1}/g" |
sed -e "s/__B__/${O2}/g" |
sed -e "s/__QUOTE__/'/g" |
tr "+" " "
fi
done 
done

Result:

~$ bash organisms.sh 
insert into SPECIES(org1,org2,dist) values ('Drosophila melanogaster','Homo Sapiens','782.7 Million Years Ago');
insert into SPECIES(org1,org2,dist) values ('Drosophila melanogaster','Escherichia coli','2535.8 Million Years Ago');
insert into SPECIES(org1,org2,dist) values ('Canis lupus familiaris','Homo Sapiens','94.2 Million Years Ago');
insert into SPECIES(org1,org2,dist) values ('Canis lupus familiaris','Drosophila melanogaster','782.7 Million Years Ago');
insert into SPECIES(org1,org2,dist) values ('Canis lupus familiaris','Escherichia coli','2535.8 Million Years Ago');
insert into SPECIES(org1,org2,dist) values ('Escherichia coli','Homo Sapiens','2535.8 Million Years Ago');
ADD COMMENTlink written 6.0 years ago by Pierre Lindenbaum118k

That's awesome! Unfortunately it's not working for me. I'm trying to figure out what's happening. I suspect is the --xpath argument in the xmllint. I don't see it in the manual nor I guess what should be doing.

ADD REPLYlink written 5.9 years ago by Biojl1.6k
1
$ xmllint --version
xmllint: using libxml version 20708
compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib
ADD REPLYlink written 5.9 years ago by Pierre Lindenbaum118k

Ok. Apparently I have version 20706. I'll update it!

ADD REPLYlink written 5.9 years ago by Biojl1.6k

I'm not sure that will fix it. I saw some versions of xmllint missing the '--xpath' argument. But there are many ways to extract this information .: xslt, /usr/bin/xpath,a simple grep "Million Years", etc...

ADD REPLYlink written 5.9 years ago by Pierre Lindenbaum118k

Finally I decided to implement it in Python. It might be slower but the output is exactly as I want. Your solution was my inspiration, thank you!

ADD REPLYlink written 5.9 years ago by Biojl1.6k
4
gravatar for David W
6.0 years ago by
David W4.7k
New Zealand
David W4.7k wrote:

There is no official way to automate this process, but check out the urls

http://timetree.org/index.php?taxon_a=homo&taxon_b=pongo&submit=Search

It should be straight forward to pick your favourite scripting language, build urls for each comparison and (maybe with a bit more difficulty) parse out the dates from the resulting pages.

Just a matter of deciding if the time writing the scripts is worth avoiding the pain in your wrist

ADD COMMENTlink written 6.0 years ago by David W4.7k
3

Whoops, my scant answer crossed with Pierre's much more complete one. Should change mine to "do what Pierre says" :-)

ADD REPLYlink written 6.0 years ago by David W4.7k
4
gravatar for omeara.brian
5.9 years ago by
omeara.brian50
omeara.brian50 wrote:

Note that TimeTree asks that you don't do this; from the bottom of their page: "Currently large scale, automated, data-mining is not permitted". I haven't tested to see if it's possible (I imagine it would be, though an easy thing to do on their end would be to block your IP eventually), but they don't want you to.

We've been building a more open alternative to TimeTree called DateLife.org. It still needs more trees (TimeTree is much better populated) but we encourage scraping, downloading the source, downloading the set of trees, etc. Let me know if you have patches or more trees for it.

ADD COMMENTlink written 5.9 years ago by omeara.brian50

Very good initiative, I'll take a look. I fail to see why TimeTree does not provide tools to mine their database, to me it's a terrible mistake, encouraging researchers not to use it.

ADD REPLYlink written 5.9 years ago by Biojl1.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1212 users visited in the last hour