I am reading The Human Genome Browser at UCSC by Kent, et al.
In Figure 7, they discuss a binning approach to search for elements which overlap or are contained in a given range:
From the paper:
Features are put in the smallest bin in which they fit. A feature covering the range indicated by line A would go in bin 1. Similarly, line B goes in bin 4 and line C in bin 20. When the browser needs to access features in a region, it must look in bins of all different sizes. To access all the features that overlapped or were enclosed by line A, the browser looks in bins 1, 2, 3, 7, 8, 9, 10, and 11. For B the browser looks in bins 1, 4, 14, 15, 16, 17. For C, the browser looks in bins 1, 5, and 20.
The bins are sized in the following manner: 128 kb, 1 Mb, 8 Mb, 64 Mb, and 512 Mb.
My guess is that references to each element/feature (not the element itself) are stored in the smallest bin that will contain it. If that's correct, is there any compression done to the list of references?
Further, how is binning done for arbitrary elements or search queries larger than the largest bin size? For example, let's say I provide input elements that were the extents of arbitrary chromosomes longer than 512 Mb, which would not be impossible for some organisms. How are searches then performed for elements of that type? Is there an ordered list of adjacent bin structures to span the minimum and maximum range over the entire input element set?
Are elements/features (element/feature references) retrieved in sort order? (Are bins queried in a certain order?)