I was asked to do some bibliography regarding how sequencing results are stored and especially how databases regarding NGS results are designed in most cases but I've found more articles about the IT requirements for storage of sequencing data than about database designs for the storage of these sequencing data, this is why I am here.
The facility where I am doing my internship is planning to create a database in order to store the sequencing results after its generation and analysis (mostly DNA seq data). We start from scratch, for the moment our run data is still stored on our server in their origin folder, same for the Fastq files and other files (BAM, FastQC...) but in the future we would like to organize it better because we are planning to perform de novo assembly on all of our collection and the results (Fastq especially) have to be easily accessible thanks to a user-friendly interface to the searchers of the facility. For the moment my only database reference would be the NCBI SRA database but I am not sure if its construction is really relevant for our goal which will be a private database containing mostly genomes and for only around 6000 micro-organisms DNA sequences.
Do you have any reference article/book/website/forum post I could read, person or institution I could contact in order to see how a database to store Next Generation Sequencing results (Fastq files, information on the sequencing run, information on the species...) is constructed ? What kind of database schema is usually adopted and what data is kept in priority ? For example :
- Should we sort data by sample IDs that we created, or by species/strains, by run ...?
- We would keep the Fastq files but not QC files that could be generated again later. If VCF files are generated, it seem to have to be kept, is this the case for alignment BAM files ? Some seem to keep it, some others don't...
- What kind of criteria should we keep : Samples IDs, Run IDs, Library layout (paired end...)... We only have one Illumina NextSeq 550 sequencer, it does not seem to be relevant to add a database column regarding the sequencing platform used for example (as in SRA database)
Don't hesitate if my question isn't clear or if you need more information. I am kind of confused after lots of readings not directly related to the indications I am looking for.
Thank you in advance, any tip might be helpful.