Question

Programmatic Access To The Carbohydrate-Active Enzymes Database (Www.Cazy.Org)?

1

Entering edit mode

12.7 years ago

James Estevez ▴ 90

I'd like to use their database to perform a comparative analysis, but there doesn't seem to be any way to access it via flat file or some sort of REST or SOAP interface. I'm interested in the data in genome pages like this one, but I'd rather not wget and parse all that HTML (Although it seems well structured so it wouldn't be a huge time sink to do so).

Am I missing an obvious solution?

parsing • 7.3k views

ADD COMMENT • link updated 11.3 years ago by Alexander Viborg ▴ 10 • written 12.7 years ago by James Estevez ▴ 90

score 2 · Answer 1 · 2011-11-22

Chalk this one up to reinventing the wheel. There is a webservice available via the CAZymes Analysis Toolkit (Park et al.):

Obtaining data from the CAZy database

The information content of each CAZy family is directly extracted from its HTML pages and populated into the local database. The HTML pages obtained through a GET request for a family are parsed to associate the family with the GenBank accession number, related CAZy families, known activities, EC numbers, and available cross references to other databases including UniProt (The UniProt Consortium 2009), Pfam (Finn et al. 2008), and PDB (Kirchmair et al. 2008). The latest download was made on 30 Sep 2009. The local database was built using MySQL, and Perl scripts developed in house are used to create and update the database.

[?] [?]Park, Byung H, Tatiana V Karpinets, Mustafa H Syed, Michael R Leuze, and Edward C Uberbacher. “CAZymes Analysis Toolkit (CAT): web service for searching and analyzing carbohydrate-active enzymes in a newly sequenced organism using CAZy database.” [?]Glycobiology[?] 20, no. 12 (December 2010): 1574-1584.[?] [?] [?]

score 2 · Answer 2 · 2012-02-21

2

Entering edit mode

12.2 years ago

Alexander Viborg ▴ 20

I've done these scripts for the CAZy database, which in contrast to the server by Park et al. doesn't use a local database and thus is alsways up to date and a lot faster than the server by Park et al.

Use it as you like, find the link through my website: http://ahv.dk/

Enjoy Alexander

ADD COMMENT • link 12.2 years ago by Alexander Viborg ▴ 20

score 1 · Answer 3 · 2011-08-21

1

Entering edit mode

12.7 years ago

Lars Juhl Jensen 11k

It looks like all the data that you are after are put in the page server-side. I do not understand what makes you think that they use JavaScript to construct each page, and I certainly do not see anything to reverse engineer.

As far as I can see, your only option is to download all the HTML pages and parse them.

ADD COMMENT • link 12.7 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

Rough script to do so is here

ADD REPLY • link 12.4 years ago by James Estevez ▴ 90

0

Entering edit mode

Rough script to do so is here: gist.github.com/1386797

ADD REPLY • link 12.4 years ago by James Estevez ▴ 90

0

Entering edit mode

Rough script to do so is here: http://gist.github.com/1386797

ADD REPLY • link 12.4 years ago by James Estevez ▴ 90

score 1 · Answer 4 · 2011-09-06

Chances are, their data is stored in a relational database backend. In the absence of an API, you could try sending a polite email to the database owners/administrators explaining your situation and asking for a database dump (eg, as SQL). This could then be reconstituted locally and queried via SQL.

Your request may just be rejected or ignored, but it's always worth a shot.

Ram · Answer 5 · 2013-01-03

Since my last post I've started to re-code many of my bioinformatic assistant scripts, both to follow the changes on the CAZy website but also to make them even faster.

To extract protein sequences from the CAZy database please use my tools here: http://research.ahv.dk/cazy

You can specify any family, sub-family, and organism and I'll also add the opportunity to specify E.C number. It should take less than a minute for instance to extract the big GH13 family of ~13.000 sequences.

These scripts are by far superior to the Park et al server, as these does not rely on a local copy of CAZy but takes the available data directly from here. I do also have a full copy available updated every 6 months, for those who wants to set up their own database to BLAST or the like.

Happy research Alexander