hello I am working on a project to store and manage genome sequence data. Specifically microbial genomes. The tool should store several thousand genomes and need to find and extract sequences (contigs, genes, proteins) quickly. What is the best way to implement this?
I saw this post but it is 5 years old: Describe Your Architecture: Uniprot
My specific questions are 1) Which database is suitable to store sequence metadata (e.g. name, length, species, etc.). Looks like uniprot used berkleydb as their main db and indexed some data in lucene for searching. Correct? 2) Where to store the sequence itself so I can retrieve it quickly?
Looks like Uniprot indexes the metadata in lucene ... but I am not sure how they store sequences ... Also do they use
Any help will be greatly appreciated.