Question: Sequence Data Management
gravatar for navela78
9 weeks ago by
navela7830 wrote:

hello I am working on a project to store and manage genome sequence data. Specifically microbial genomes. The tool should store several thousand genomes and need to find and extract sequences (contigs, genes, proteins) quickly. What is the best way to implement this?

I saw this post but it is 5 years old: Describe Your Architecture: Uniprot

My specific questions are 1) Which database is suitable to store sequence metadata (e.g. name, length, species, etc.). Looks like uniprot used berkleydb as their main db and indexed some data in lucene for searching. Correct? 2) Where to store the sequence itself so I can retrieve it quickly?

Looks like Uniprot indexes the metadata in lucene ... but I am not sure how they store sequences ... Also do they use

Any help will be greatly appreciated.

genome • 191 views
ADD COMMENTlink modified 2 days ago by Biostar ♦♦ 20 • written 9 weeks ago by navela7830

1) Which database is suitable to store sequence metadata (e.g. name, length, species, etc.)

depends of the number of metadata, sequences, complexity ...

create table(id,name, bases, species_id);
create species(id,name);

would work...

ADD REPLYlink written 9 weeks ago by Pierre Lindenbaum102k

Thanks for your response Pierre.

The number of metadata is large ... in the order of hundreds of millions. The complexity will be low. It will have simple things like name, length, start coordinate, end coordinate, unique ID. The only complex thing that may be there in the metadata is the taxonomic lineage of that sequence ... e.g. (Streptococcus -> Firmicutes -> Streptococcaceae - > ...)

Which db can handle such large data? When you say bases, are you suggesting that I store the sequence bases in the database itself (as opposed to a flat file)? Usually database systems are not tuned to store such large sequences, correct? Imagine a bacterial chromosome of ~5MB length ... is it good practice to store it in a database?

ADD REPLYlink written 9 weeks ago by navela7830

Do not create an answer when replying to a comment or answer. This makes the questions appear as answered. Use the "add reply" button instead.

ADD REPLYlink written 9 weeks ago by Jean-Karim Heriche13k

Three points to consider:
1- MySQL/MariaDB and PostgreSQL can both handle tables with millions of rows. I've a MySQL table with ~400 million rows. To get good performance you may need to tune the configuration and it may also be beneficial to split the data according to your access pattern(s). The key to making a database useful is to design and index it properly.
2- You can keep sequences in files and only store the paths to the files in the database. For many applications this is more convenient but this depends on what the downstream applications are.
3- If this is going to be a resource used regularly, you should consider building an API.

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by Jean-Karim Heriche13k

While it's not a database, I've had success using HDF5 for various projects.

ADD REPLYlink written 2 days ago by Eric Lim350
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1326 users visited in the last hour