Question: Programmatic Access To The Carbohydrate-Active Enzymes Database (Www.Cazy.Org)?
1
gravatar for James Estevez
7.6 years ago by
Tacoma, WA
James Estevez90 wrote:

I'd like to use their database to perform a comparative analysis, but there doesn't seem to be any way to access it via flat file or some sort of REST or SOAP interface. I'm interested in the data in genome pages like this one, but I'd rather not wget and parse all that HTML (Although it seems well structured so it wouldn't be a huge time sink to do so).

Am I missing an obvious solution?

parsing • 5.2k views
ADD COMMENTlink modified 6.2 years ago by Alexander Viborg10 • written 7.6 years ago by James Estevez90
2
gravatar for James Estevez
7.3 years ago by
Tacoma, WA
James Estevez90 wrote:

Chalk this one up to reinventing the wheel. There is a webservice available via the CAZymes Analysis Toolkit (Park et al.):

Obtaining data from the CAZy database

The information content of each CAZy family is directly extracted from its HTML pages and populated into the local database. The HTML pages obtained through a GET request for a family are parsed to associate the family with the GenBank accession number, related CAZy families, known activities, EC numbers, and available cross references to other databases including UniProt (The UniProt Consortium 2009), Pfam (Finn et al. 2008), and PDB (Kirchmair et al. 2008). The latest download was made on 30 Sep 2009. The local database was built using MySQL, and Perl scripts developed in house are used to create and update the database.


[?] [?]Park, Byung H, Tatiana V Karpinets, Mustafa H Syed, Michael R Leuze, and Edward C Uberbacher. “CAZymes Analysis Toolkit (CAT): web service for searching and analyzing carbohydrate-active enzymes in a newly sequenced organism using CAZy database.” [?]Glycobiology[?] 20, no. 12 (December 2010): 1574-1584.[?] [?] [?]

ADD COMMENTlink written 7.3 years ago by James Estevez90
2
gravatar for Alexander Viborg
7.1 years ago by
Alexander Viborg20 wrote:

I've done these scripts for the CAZy database, which in contrast to the server by Park et al. doesn't use a local database and thus is alsways up to date and a lot faster than the server by Park et al.

Use it as you like, find the link through my website: http://ahv.dk/

Enjoy Alexander

ADD COMMENTlink written 7.1 years ago by Alexander Viborg20
1
gravatar for Lars Juhl Jensen
7.6 years ago by
Copenhagen, Denmark
Lars Juhl Jensen11k wrote:

It looks like all the data that you are after are put in the page server-side. I do not understand what makes you think that they use JavaScript to construct each page, and I certainly do not see anything to reverse engineer.

As far as I can see, your only option is to download all the HTML pages and parse them.

ADD COMMENTlink written 7.6 years ago by Lars Juhl Jensen11k

Rough script to do so is here

ADD REPLYlink written 7.3 years ago by James Estevez90

Rough script to do so is here: gist.github.com/1386797

ADD REPLYlink written 7.3 years ago by James Estevez90

Rough script to do so is here: http://gist.github.com/1386797

ADD REPLYlink written 7.3 years ago by James Estevez90
1
gravatar for Pansapiens
7.6 years ago by
Pansapiens30
AU
Pansapiens30 wrote:

Chances are, their data is stored in a relational database backend. In the absence of an API, you could try sending a polite email to the database owners/administrators explaining your situation and asking for a database dump (eg, as SQL). This could then be reconstituted locally and queried via SQL.

Your request may just be rejected or ignored, but it's always worth a shot.

ADD COMMENTlink written 7.6 years ago by Pansapiens30
0
gravatar for Alexander Viborg
6.2 years ago by
Copenhagen, Denmark
Alexander Viborg10 wrote:

Since my last post I've started to re-code many of my bioinformatic assistant scripts, both to follow the changes on the CAZy website but also to make them even faster.

To extract protein sequences from the CAZy database please use my tools here: http://research.ahv.dk/cazy

You can specify any family, sub-family, and organism and I'll also add the opportunity to specify E.C number. It should take less than a minute for instance to extract the big GH13 family of ~13.000 sequences.

These scripts are by far superior to the Park et al server, as these does not rely on a local copy of CAZy but takes the available data directly from here. I do also have a full copy available updated every 6 months, for those who wants to set up their own database to BLAST or the like.

Happy research Alexander

ADD COMMENTlink modified 3.1 years ago • written 6.2 years ago by Alexander Viborg10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1262 users visited in the last hour