Gbrowse2 Mysql Indexing
2
0
Entering edit mode
12.4 years ago

Dear all,

I'm setting up a gbrowse2 server, and I noticed the queries to the apache2 server all end up in

Timeout waiting for output from CGI script /var/www/cgi-bin/gb2/gbrowse

Here I'm dealing with hundreds of thousands of sequences (40GB of eukaryotic genome annotation), therefore I guess the database needs to be indexed. I'm using MySQL. Question is: which table/field should be indexed? I couldn't find an answer on their tutorial.

Thanks a lot!

Federico

EDIT: yes, the demo examples work. Also my data, reduced to less than 10GB of GFF3 lines, is working perfectly with no timeout. Addittionally, increasing the timeout limits in the gbrowse.conf file will eventually provide a result. I tried the speed-up tricks suggested by the gbrowse tutorial (i.e. increasing the memory available for MySQL), without a big improvement in the performance. I know there is a field in a table of the gbrowse2 database that needs to be indexed to have a massive improvement in the performance. I thought that would be feature::seqid or name::name but no success.

mysql • 2.6k views
ADD COMMENT
1
Entering edit mode

is the volvox demo working properly?

ADD REPLY
1
Entering edit mode

Don't know gbrowse data model but mysql: Try using 'mytop' on the db server to see which query is being issued. Then you could optimize this query by either creating an index (if it makes sense). Try the OPTIMIZE TABLE command in mysql first to clean up existing indices.

ADD REPLY
0
Entering edit mode

Yes, all the demos work. Also my data works, when reduced to less than 10GB of GFF3 lines let's say. As I wrote in the other comments, I know it's a bottleneck in the MySQL query, and I know I fixed a couple years ago by indexing the right field in the right table generated on MySQL. BU then forgot which table that was.

ADD REPLY
4
Entering edit mode
12.4 years ago
Scott Cain ▴ 770

It is generally best to ask GBrowse questions on the GBrowse mailing list, since there are many experienced users who follow the list.

That said, Sean's suggestions make sense. When you say "hundreds of thousands of sequences" what do you mean? Are those references sequences (like a transcriptome)? Or are they mapped reads? If they are mapped reads, perhaps should be using Bio::DB::Sam or Bio::DB::BigWig.

ADD COMMENT
0
Entering edit mode

It's a complex eukaryotic genome with coverage, gene models, ESTs and anything you can think of. I cannot tell you the number of scaffolds and nt, but we are talking about 40GB of data. The Bio::DB:BigWig is actually a good idea, although I'm sure a huge improvement can be achieved by properly indexing one of the MySQL table. I know because I did it two years ago (and then went on holiday and forgot which table that was... ok -1 on me).

ADD REPLY
1
Entering edit mode
12.4 years ago

Hundreds of thousands of sequences is not large by gbrowse standards, I do not think. A standard gbrowse instance includes indexing when the database is built. You should probably confirm that the reason that the the code is timing out is the database. You could also run the SQL query outside the CGI environment.

ADD COMMENT
0
Entering edit mode

It's totally depending on the size. I can avoid the timeour issue by reducing the number of sequences. ANd yes it's the MySQL query that takes long, essentially the bottleneck is there. I tried the tricks on the tutorial (changing the amount of memory available for MySQL, it didn't massively improve the situation).

ADD REPLY

Login before adding your answer.

Traffic: 1868 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6