Question: Technology advice: NGS shotgun sequencing on small proteins - Data Storage Option? (Hardware + Software)
gravatar for jamesy
5.4 years ago by
United Kingdom
jamesy0 wrote:

So I'm here looking for a bit of advice/guidance from those experienced with big data infrastructure sequencing data storage. I'm a computer scientist with an undergrad Biology degree, but I'm fairly sure I've forgotten most of it by now. So please correct me if I make any incorrect assumptions below.

The lab I work for is looking into buying an NGS machine for shotgun sequencing of a 'mixed pot' of small protein dna ( ~250-300bp). As I understand it shotgun analysis is usually for high throughput sequencing by breaking up larger dna strands into lots of smaller ones (~300bp I've been told) sequencing and then allowing alignment of the pieces to get an idea of the whole sequence.  as we're trying to sequence such small lengths the shotgunning technique should allow us to just sequence the whole protein dna.

We need to store these sequences with metadata of what projects, assays,  they were from (and if they are seen in other projects later on add that link/reference) among other things.

We will need to query the database for sequence matches when we get new reads in from sequencing (for just placing a new reference to a project if the sequence exists in the db; as well as looking for projects that already have that protein) or partial matches for general scientific for appearance of specific regions of dna.

We've estimated something towards 240GB a week data being produced that we need to store. 

From initial research I feel like NoSQL is the way to go, we've recently come to conclusion that mongoDB isn't suitable due to the relations we would need to store.  And I guess HBase at the moment is the front runner.

All advice welcome.

ADD COMMENTlink modified 5.4 years ago by Sean Davis26k • written 5.4 years ago by jamesy0
gravatar for Sean Davis
5.4 years ago by
Sean Davis26k
National Institutes of Health, Bethesda, MD
Sean Davis26k wrote:

I think you should talk to a group already doing this, as I think you are discussing setting up an NGS system?  If so, you risk reinventing wheels without some direct interaction with folks doing this already.  Since you are in the UK, there are likely a number of groups who would be willing to talk with you and make suggestions.

Just FYI, In my experience, databases have no place for storing "the data", only for storing metadata.  

ADD COMMENTlink written 5.4 years ago by Sean Davis26k

And also, in general, this. I am sure there are plenty of people even at your institution with experience in NGS that you can talk to, and you definitely should. What sequencer you are looking to get, how often you intend on running it consistently, and how long you want to keep data around for are all very important questions. You said 240 GB a week? What sequencer were you looking at? If a single MiSeq you might have trouble generating that much data per week factored in with the run time, and that number seems a low to me for a HiSeq run.


There are also plenty of excellent talks online to do with computation and data storage around NGS. Most are oriented towards people running what amounts to a core facility, but if you are really going to generate 240GB of raw data to be kept per week, plus secondary analysis data, you're going to want to think bigger in terms of storage. BioTeam has quite a few out there you might find worth reading on slideshare.

ADD REPLYlink written 5.4 years ago by DG7.2k

Thanks, We're a small bioengineering company so when I said lab, I guess I mean company wide it's the first one we'll have.  The Scientist's haven't been very concrete in providing all the requirements yet those are just the numbers they provided, which may indeed change.  The project cycles are around 2 weeks and other parts of the process limit how many projects we can push through so that may very well limit HiSeq data if that's what we're getting.

We've been searching hard, I've found a lot of old posts on here that link to resources from bioteam that were enlightening, but a lot of the posts that have a relevant topic on biostar were 4 years old or so so I'm just hoping for someone who has had some more recent experience, possibly with implementing Hadoop clusters, HBase with Map Reduce algorithms for parallel data computing.

In response to Sean, I don't personally have any contacts outside this lab in this field, but I've asked the scientists involved to enquire with our partner labs/universities etc.  

I'll try update the question when we get more spec. out of the scientists.

ADD REPLYlink written 5.4 years ago by jamesy0

If your company wants to do this, and they don't have a bioinformatics person with NGS experience, they REALLY, REALLY should. Don't get lost in the weeds of "Big Data." While there has been some work adapting "big data" systems and programming to genomics, it hasn't been that successful for a variety of reasons (some worthwhile reasons, other mostly to do with scientific inertia compared to web 2.0 company strategy). HBase and Hadoop aren't widely used in genomics right now, and most of the gold standard programs and pipelines people use don't take advantage of that architecture. They are all set up for running on fairly standard HPC set ups. If you are interested in exploring that sort of set up you may want to check out something like Mesos, which lets you run multiple architectures on a cluster. You'll certainly find more expertise with hBase, Hadoop, etc in the web IT side of things than bioinformatics.

It's also worth pointing out that BioTeam is a consulting shop. Again, if you lack the in house expertise it is worth bring in that expertise either through hiring or some contract consulting.

ADD REPLYlink written 5.4 years ago by DG7.2k

Don't try to separate the bioinformatics processing from the IT infrastructure. Hadoop and HBase are not really relevant for standard NGS data processing. That isn't to say that they shouldn't be, but you'll end up using open source tools for most of your analysis of NGS data, and those are largely designed to run on commodity hardware and not use Hadoop/MapReduce. As Dan points out, don't get lost in the weeds of "Big Data".

ADD REPLYlink modified 13 months ago by _r_am31k • written 5.4 years ago by Sean Davis26k

What would you call standard NGS data processing? (I haven't had any experience with it at all). Our use of NGS is for rapid sequencing of protein dna to add data to our current process (it isn't really vital as I understand it more for analysing general trends in the business process [e.g. experiments tend to be sh** when these sequences are involved, etc.], but hey, I'm not the scientist here)

As far as I'm concerned, and have been informed, we'll need to store ~20,000,000 new protein sequences (~98aminoacids) a month.  That's in probably 20 lots of 1,000,000 (20 projects). 

Every time a sequence is entered into the database, we'll need to scan/query the existing sequences so only unique ones are entered and stored. and already existing sequences have a new reference/relation to the project that's just been read in, added to it.

I'm just assuming this counts as "big data", what I do need is decent query/scan access and a database that will scale well as more sequences are added. The reason I mentioned Hadoop/MapReduce before is the parallel processing provided fairly easily (?) and therefore reduced search times.

Other general uses of the database is retrieving what sequences have appeared in a project, and what other projects a sequence has appeared in.


Not sure if this clears anything up?  Thanks for your feedback though, it's made me think.  And we're reaching out to a lab or two.

ADD REPLYlink written 5.4 years ago by jamesy0

Standard NGS data processing depends a little bit on your application. Microbiology and novel genomes is different than resequencing in model organisms for instance. Under your workflow for instance is the DNA coming from a single organism or different organisms? Are they always organisms with reference genomes or all from unsequenced species? Maybe a mixture of both? What tools will go from DNA to predicted amino acid sequence? How do you plan on doing sequence comparisons? How will you handle inexact matching? Allowing for sequencing error? etc. These are all very important questions that need to be addressed before you decide on infrastructure and architecture. It may turn out to be more efficient to do something 'standard' (ie BLAST) than to use fancy big data analytics methods. But I don't fault you for your interest in them, there are projects out there looking to do more of this, and it would be good for more bioinformatics tools to make use of these frameworks than re-implement (badly) some of the same techniques. Especially when it comes to data locality of computations, scale, and parallelization.

ADD REPLYlink written 5.4 years ago by DG7.2k

Just to comment, Hadoop and other Big Data solutions are actually designed to run on commodity hardware, which is why they scale and CAN scale to large numbers of nodes inexpensively. This is contrasted with traditional enterprise IT, which is obnoxiously expensive and not very scalable. It is also contrasted with traditional HPC, which is not typically considered commodity hardware either, and most architectures over a certain size get quite complex.


However, things are starting to converge. I am currently looking at building a small cluster that shares elements of similarity with both small scale HPC and small scale "big data"

ADD REPLYlink written 5.4 years ago by DG7.2k

Totally agree with your comments.  I just wanted to point out that many "big data" solutions that employ Hadoop and the MapReduce paradigm haven't found much use in genomics yet.  As we move toward more performant systems such as Spark on machines that can easily have 1/4 to 1/2 TB of RAM, that may change.

ADD REPLYlink written 5.4 years ago by Sean Davis26k

I agree, although I think it is mostly an artifact of siloing and specialization. All of the early NGS bioinformatics people were coming from HPC backgrounds.

ADD REPLYlink written 5.4 years ago by DG7.2k

You could theoretically store the raw data in the database. This hasn't been done a lot with NGS data because downstream tools can't work with it directly, but you could certainly do it if you wanted.

ADD REPLYlink written 5.4 years ago by DG7.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1617 users visited in the last hour