Describe Your Architecture: Uniprot (2017)
Entering edit mode
4.1 years ago
navela78 ▴ 60

This question is related to

C: Describe Your Architecture: Uniprot

Asking it again because Jerven could not add to the original post ....

What does the architecture of Uniprot look like today (Oct 2017)? How has it evolved?

If you were to implement a software to store several hundreds of millions of sequences (of genomes, proteins, genes) and related metadata (e.g. name, species, taxonomy lineage) today what technologies would you use? I am specifically interested in the back end:

1) Where will you store metadata 2) Where will you store sequences for fast retrieval? Flat files or databases or something else (e.g. a blastdb and retrieve using blastdbcmd)?

genome • 1.0k views
Entering edit mode
4.1 years ago
me ▴ 740

This is an update to the answer here considering the architecture of the UniProt website. This is an architecture that makes sense for us but is not something I would recommend to anyone generally without really thinking hard about it.

UniProt website architecture in 2017

UniProt is a consortium database maintained by 3 partners SIB, EBI, PIR. EBI and PIR run the servers that run the website. The UniProt data is mostly read only on the public website due to 4 weekly release cycle. Only job data like blast etc... is read-write.

All partners use Linux Cent-OS or RedHat 7 as their servers OS. EBI runs a proprietary load balancing solution while PIR use Apache 2+ with mod_proxy. Behind this font end we run a total of 10 tomcat servers. 8 at EBI (4 per datacentre, per EBI policy 4 are in hot standby mode and do not serve traffic on normal days) and 2 at PIR. On the latest security release of java. We use DNS round robbin for datacentre disappearance and load balancing behind it (you need to have both).

PIR uses physical machines with direct attached local harddisks while EBI uses virtual machines (VMware) with either Tintri attached disks for the data part or their Isilon system for temp files etc...

Search is provided by Lucene 6.6 and soon the 7 series. This enables a few features that stock lucene/solr do not provide such as the cross dataset queries.

Most data is no longer stored in BDB/je but in a custom datastore which is a classic key/offset map stored either using lucene or in memory depending on the data subset, The offset points into a large memory mapped file to a LZ4 or Zstd compresed binary representation of a data record. LZ4 for datasets like taxonomy, Zstd for UniParc, UniRef etc... Sequences are just data and not stored independently from the other data in our system. Key for this kind of architecture is to make sure we don't do more than 5 disk seeks per UniProtKB entry page (1 for the entry, up to 3 for similar proteins and 1 for taxonomy).

All state is injected from a xml file using the Spring dependency injection framework into different Struts actions. Which in practice is injection via webcontext. This is fine as our custom datastore is good for concurrent access as is our search engine lucene.

High availability is achieved by having individual data copies. Jobs such as blast data is shared on demand via http requests between mirrors. Which is ok as users rarely access data older than one hour.

Storage needs for release 2017_09 is almost 307GB for storing the data records and 438GB for lucene indexes (version 6.6).

I am still very happy with the architecture today. It has been with us with minimal issues for more than 10 years now and is still very performant. We are very happy with how Lucene kept up with the data explosion over time. Also we have a measured uptime of more than 99,9% which is not trivial with so many datacentres and users.

For the rest our Struts/JSP code could use an update as well and we have an active plan for that, but implementation is not quite decided yet.

The website as is only suitable for finding entries if you want to do analytical queries I strongly recommend using our sparql endpoint. This is a interface that exposes all the UniProt data via standard query language that is very suitable for deep queries over our datamodel.

Entering edit mode
4.1 years ago
me ▴ 740

Recommendation for a new site project

Just put it in RDF and pick any SPARQL store. Mostly, if you don't know what your usage is going to be like it's much better to be very flexible in the way you can retrieve data eating some cost in CPU/hardware inefficiency to get more of that flexibility. Also SPARQL and RDF as standards allow you to change your datastore without needing to rewrite all and everything touching the data.

If your first RDF serialization of choice is JSON-LD you save even more effort as you can use that directly in your frontend website code without further API building.

In general, if I needed to this quick and in a way that I could support over long term in e.g. a clinical setting the datastore I would pick is probably marklogic.

If you are working for an academic institution that does not have any experience at all with data at scale I would doubly recommend RDF as it is really going to help making your data FAIR, again it's better to eat the cost in HD's unless your SRA. GenBank/ENA size well within capabilities of SPARQL/RDF data stores. If you invest ENA and GenBank like engineering effort at it. Their current architecture did not appear in one day either but had decades of tuning and adaptation.

Entering edit mode

Thank you so much! I really appreciate it. I have already spent couple of hours reading your answers. I work for a start-up and we do not have experience with data at this scale. I have been tasked to look into this big data management problem and maybe we will hire someone with more experience soon. I am thinking of using lucene for the search because I have some experience with it. But I am really stuck on the sequence data part. I have a couple of questions

1) I didn't quite understand: "Most data is no longer stored in BDB/je but in a custom datastore which is a classic key/offset map stored either using lucene or in memory depending on the data subset, The offset points into a large memory mapped file to a LZ4 or Zstd compresed binary representation of a data record."

Could you dumb this down for me? The only thing I am understanding is that your data record contains both the sequence and the metadata (e.g. name, length, species). The record is stored in a file and compressed in LZ4 format. What is stored in lucene? A key (e..g the file path) that quickly finds the LZ4 file? Sorry I am really lost ... with offset ... memory mapping ...etc. Could you explain or point me to some learning resources?

2) By SPARQL store do you mean something like this?

Really the only use cases I have are: 1) When a user selects one or few sequence identifiers (after a search using lucene) I need to quickly retrieve the sequences for further analysis (e.g. send it to our blast page in the front end) 2) User can select several thousands or tens of thousands of sequence identifiers and download the sequences.

Marklogic is out of question - I doubt we have the $$ to fund it. Will something like Jena scale? we will have hundreds of millions of sequences. Do I even need such datastores for my usecase?

Again, I really appreciate all your help

Entering edit mode

Don't follow the UniProt architecture it is too niche and not appropriate for you at this point in time. It's a formula 1 car and I am not sure that you are the stage where you need a moped or can still bike ;)

Jena TDB, Virtuoso will scale to a few 100 million of sequences given a reasonable hardware budget. At about a a trillion you will need to start and think hard about what you are doing. But if using SPARQL you will only need to fix your backend not rewrite your frontend when that day comes.

How long will you only search on sequence id's. When will your users need to do more accurate searches using the "meta"data e.g. where is the sequence from, which experiment, which domains, specific profiles.

What is fast enough, how many users. Given a blast will run for about 10seconds at that scale how fast should you pump data into that system at all? Are you cloud based or not?

Are you storing sequences or reads? Also maybe get a consulting contract. If you are going to generate millions of your own sequences then your lab costs will be significant and something like marklogic will fit in your budget (promethION early access is 135,000 in starting costs...). Especially if you are in a clinical space and will need to deal with legal requirements.

Entering edit mode

Yes we are still in the biking stage :) Thanks for your pointers.

Just to clarify - We are not working in a clinical setting. The software will not store reads. It is supposed to store protein and gene sequences along with metadata like species, lineage, annotations, length, etc.

Again, thank you ... I will look into Jena TDB and Viruoso ...


Login before adding your answer.

Traffic: 2110 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6