Question: Sequence Data Management
gravatar for navela78
13 months ago by
navela7850 wrote:

hello I am working on a project to store and manage genome sequence data. Specifically microbial genomes. The tool should store several thousand genomes and need to find and extract sequences (contigs, genes, proteins) quickly. What is the best way to implement this?

I saw this post but it is 5 years old: Describe Your Architecture: Uniprot

My specific questions are 1) Which database is suitable to store sequence metadata (e.g. name, length, species, etc.). Looks like uniprot used berkleydb as their main db and indexed some data in lucene for searching. Correct? 2) Where to store the sequence itself so I can retrieve it quickly?

Looks like Uniprot indexes the metadata in lucene ... but I am not sure how they store sequences ... Also do they use

Any help will be greatly appreciated.

genome • 548 views
ADD COMMENTlink modified 11 months ago by Biostar ♦♦ 20 • written 13 months ago by navela7850

1) Which database is suitable to store sequence metadata (e.g. name, length, species, etc.)

depends of the number of metadata, sequences, complexity ...

create table(id,name, bases, species_id);
create species(id,name);

would work...

ADD REPLYlink written 13 months ago by Pierre Lindenbaum114k

Thanks for your response Pierre.

The number of metadata is large ... in the order of hundreds of millions. The complexity will be low. It will have simple things like name, length, start coordinate, end coordinate, unique ID. The only complex thing that may be there in the metadata is the taxonomic lineage of that sequence ... e.g. (Streptococcus -> Firmicutes -> Streptococcaceae - > ...)

Which db can handle such large data? When you say bases, are you suggesting that I store the sequence bases in the database itself (as opposed to a flat file)? Usually database systems are not tuned to store such large sequences, correct? Imagine a bacterial chromosome of ~5MB length ... is it good practice to store it in a database?

ADD REPLYlink written 13 months ago by navela7850

Do not create an answer when replying to a comment or answer. This makes the questions appear as answered. Use the "add reply" button instead.

ADD REPLYlink written 13 months ago by Jean-Karim Heriche16k

Three points to consider:
1- MySQL/MariaDB and PostgreSQL can both handle tables with millions of rows. I've a MySQL table with ~400 million rows. To get good performance you may need to tune the configuration and it may also be beneficial to split the data according to your access pattern(s). The key to making a database useful is to design and index it properly.
2- You can keep sequences in files and only store the paths to the files in the database. For many applications this is more convenient but this depends on what the downstream applications are.
3- If this is going to be a resource used regularly, you should consider building an API.

ADD REPLYlink modified 13 months ago • written 13 months ago by Jean-Karim Heriche16k

While it's not a database, I've had success using HDF5 for various projects.

ADD REPLYlink written 11 months ago by Eric Lim1.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1423 users visited in the last hour