As I said in the beginning, I'll try to give feedback on my progress and experiences as I go along. I group this by components.
I started with a simple test installation of GBrowse2 on my Macbook. The test installation took about 3 hours, most of the time spent installing dependencies using CPAN and MacPorts.
I followed the instructions in the HowTo. An important point to note: the instructions are quite good, and you should follow them as close as possible! At times I messed up but then I noticed I didn't read carefully and trying to jump ahead.
GBrowse is written in Perl and requires BioPerl, other requirements include Apache2 (installed via MacPorts), MySQL (don't install via Macports, compiling takes too long, download it). I needed to add 2 lines to my apache.conf, copy some files and restart the web-server.
After that, I had a local install up, and spent another day playing with the configuration, and working through the configuration tutorial, which I also recommend. GBrowse is highly flexible and configurable with respect to appearance and tracks displayed, everything is done using configuration files, no programming necessary so far.
The easy install is based solely on files and 'in-memory' data; while that works fine for small (test-data) sets, it doesn't scale to a >100 Mbp real genome for me. Next step is to make an install on a dedicated server with a data-base backend.
As the configuration files for GBrose and your project data are the artifacts that contain most modifications and adaptations, it might be advisable to put them under revision control. The easiest way is to use RCS in-place, but once the install gets larger and many people work on it, putting the whole configuration directory tree into revision control (git, SVN) might be preferable.
Chado is a database model for genomic data. It is meant to be used with PostgreSQL (recommended to use postgres 8.4 though I am testing with postgres 9.1). I had experience
with MySQL before, but I took a little time to figure out what is different. Chado is bundled with an installer like a normal Perl module:
> perl Makefile.PL
> make install
> make prepdb... etc.
Installation took about 3 days, but only because there was a single missing dependencies, I sent a support mail and got an answer within few hours. The problem should be fixed in the documentation and now installation should procede without problems.
Loading Data in CHADO
Possibly the most complicated part, there is a bulk loader script but after one week I haven't succeeded loading a single GFF3 file in the database. This is mainly because of format problems in the GFF3 files. First I tried with Daphnia pulex gff3 files (from JGI) then with D. melanogaster (from NCBI genomes). Both files will need repairs to be loadable into chado. That might be mainly due to the weak format definition of GFF3.
Done, I finally managed to import the GFF3 annotation file of Daphnia pulex (FrozenGeneCatalogue). It needed edits though, found via trial and error:
- Needed to edit sequence type terms to comply with the Sequence Ontology (e.g.
- All source sequences (chromosomes, contigs, scaffolds) need to be contained in the GFF file, wrote a perl script that generates GFF3 source entries from the genome fasta file and concatenates with the original file.
Lessons learned: most annotation files will likely need sanitizing, thus some scripting capabilities (perl, python, awk) and good understanding of formats is required to work with chado.