This is an update to the answer here considering the architecture of the UniProt website. This is an architecture that makes sense for us but is not something I would recommend to anyone generally without really thinking hard about it.
UniProt website architecture in 2017
UniProt is a consortium database maintained by 3 partners SIB, EBI, PIR. EBI and PIR run the servers that run the website.
The UniProt data is mostly read only on the public website due to 4 weekly release cycle. Only job data like blast etc... is read-write.
All partners use Linux Cent-OS or RedHat 7 as their servers OS. EBI runs a proprietary load balancing solution while PIR use Apache 2+ with mod_proxy. Behind this font end we run a total of 10 tomcat servers. 8 at EBI (4 per datacentre, per EBI policy 4 are in hot standby mode and do not serve traffic on normal days) and 2 at PIR. On the latest security release of java. We use DNS round robbin for datacentre disappearance and load balancing behind it (you need to have both).
PIR uses physical machines with direct attached local harddisks while EBI uses virtual machines (VMware) with either Tintri attached disks for the data part or their Isilon system for temp files etc...
Search is provided by Lucene 6.6 and soon the 7 series. This enables a few features that stock lucene/solr do not provide such as the cross dataset queries.
Most data is no longer stored in BDB/je but in a custom datastore which is a classic key/offset map stored either using lucene or in memory depending on the data subset, The offset points into a large memory mapped file to a LZ4 or Zstd compresed binary representation of a data record. LZ4 for datasets like taxonomy, Zstd for UniParc, UniRef etc... Sequences are just data and not stored independently from the other data in our system. Key for this kind of architecture is to make sure we don't do more than 5 disk seeks per UniProtKB entry page (1 for the entry, up to 3 for similar proteins and 1 for taxonomy).
All state is injected from a xml file using the Spring dependency injection framework into different Struts actions. Which in practice is injection via webcontext. This is fine as our custom datastore is good for concurrent access as is our search engine lucene.
High availability is achieved by having individual data copies. Jobs such as blast data is shared on demand via http requests between mirrors. Which is ok as users rarely access data older than one hour.
Storage needs for release 2017_09 is almost 307GB for storing the data records and 438GB for lucene indexes (version 6.6).
I am still very happy with the architecture today. It has been with us with minimal issues for more than 10 years now and is still very performant. We are very happy with how Lucene kept up with the data explosion over time. Also we have a measured uptime of more than 99,9% which is not trivial with so many datacentres and users.
For the rest our Struts/JSP code could use an update as well and we have an active plan for that, but implementation is not quite decided yet.
The website as is only suitable for finding entries if you want to do analytical queries I strongly recommend using our sparql endpoint. This is a interface that exposes all the UniProt data via standard query language that is very suitable for deep queries over our datamodel.