So I'm here looking for a bit of advice/guidance from those experienced with big data infrastructure sequencing data storage. I'm a computer scientist with an undergrad Biology degree, but I'm fairly sure I've forgotten most of it by now. So please correct me if I make any incorrect assumptions below.
The lab I work for is looking into buying an NGS machine for shotgun sequencing of a 'mixed pot' of small protein dna ( ~250-300bp). As I understand it shotgun analysis is usually for high throughput sequencing by breaking up larger dna strands into lots of smaller ones (~300bp I've been told) sequencing and then allowing alignment of the pieces to get an idea of the whole sequence. as we're trying to sequence such small lengths the shotgunning technique should allow us to just sequence the whole protein dna.
We need to store these sequences with metadata of what projects, assays, they were from (and if they are seen in other projects later on add that link/reference) among other things.
We will need to query the database for sequence matches when we get new reads in from sequencing (for just placing a new reference to a project if the sequence exists in the db; as well as looking for projects that already have that protein) or partial matches for general scientific for appearance of specific regions of dna.
We've estimated something towards 240GB a week data being produced that we need to store.
From initial research I feel like NoSQL is the way to go, we've recently come to conclusion that mongoDB isn't suitable due to the relations we would need to store. And I guess HBase at the moment is the front runner.
All advice welcome.