Question: Private Database Schema for NGS results storage (Fastq, BAM, VCF...) ?
gravatar for mikizu
12 weeks ago by
mikizu0 wrote:

Good morning,

I was asked to do some bibliography regarding how sequencing results are stored and especially how databases regarding NGS results are designed in most cases but I've found more articles about the IT requirements for storage of sequencing data than about database designs for the storage of these sequencing data, this is why I am here.

The facility where I am doing my internship is planning to create a database in order to store the sequencing results after its generation and analysis (mostly DNA seq data). We start from scratch, for the moment our run data is still stored on our server in their origin folder, same for the Fastq files and other files (BAM, FastQC...) but in the future we would like to organize it better because we are planning to perform de novo assembly on all of our collection and the results (Fastq especially) have to be easily accessible thanks to a user-friendly interface to the searchers of the facility. For the moment my only database reference would be the NCBI SRA database but I am not sure if its construction is really relevant for our goal which will be a private database containing mostly genomes and for only around 6000 micro-organisms DNA sequences.

Do you have any reference article/book/website/forum post I could read, person or institution I could contact in order to see how a database to store Next Generation Sequencing results (Fastq files, information on the sequencing run, information on the species...) is constructed ? What kind of database schema is usually adopted and what data is kept in priority ? For example :

  • Should we sort data by sample IDs that we created, or by species/strains, by run ...?
  • We would keep the Fastq files but not QC files that could be generated again later. If VCF files are generated, it seem to have to be kept, is this the case for alignment BAM files ? Some seem to keep it, some others don't...
  • What kind of criteria should we keep : Samples IDs, Run IDs, Library layout (paired end...)... We only have one Illumina NextSeq 550 sequencer, it does not seem to be relevant to add a database column regarding the sequencing platform used for example (as in SRA database)

Don't hesitate if my question isn't clear or if you need more information. I am kind of confused after lots of readings not directly related to the indications I am looking for.

Thank you in advance, any tip might be helpful.

ADD COMMENTlink modified 12 weeks ago by colindaven2.3k • written 12 weeks ago by mikizu0
gravatar for colindaven
12 weeks ago by
Hannover Medical School
colindaven2.3k wrote:

You can have a look for genomics LIMS systems, but this is a very difficult topic. Many LIMS are just maintained internally, or even cobbled togetheir by Excel sheets, or do not exist at all.

This is a good one which is actively maintained.

Here's another:

ADD COMMENTlink written 12 weeks ago by colindaven2.3k

Thank you very much for your answer, I'll have a look !

ADD REPLYlink written 12 weeks ago by mikizu0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1973 users visited in the last hour