Question: NGS data storage and management
1
gravatar for olavur
21 months ago by
olavur80
Tórshavn, Faroe Islands
olavur80 wrote:

I'm looking for a system to store human NGS data and metadata, and to retrieve data. We have a storage server with a proper distributed filesystem (Isilon OneFS).

There are some other posts discussing this topic, for example:

But I wanted to make a new post because (1) those posts are several years old, and I imagine practices are different today, and (2) they discuss file formats and distributed file systems a lot, while I'm more interested in ways to access data.

I would like to have a system, preferably with a GUI (browser is also fine), where I can search for an individual (pseudonym ID), and retrieve their data:

  • Raw NGS data (FASTQ)
  • Aligned reads (BAM)
  • Variants (VCF)
  • Metadata, for example whether the individual is part of a trio, was the individual sequenced more than once, how was the individual sequenced, etc.

I also want to be able to retrieve data (VCF or BAM or whatever is specified) from a list of individual IDs.

Some nice-to-haves:

  • Retrieve variants from individual lists in specified gene(s), loci or type of variation.
  • Incorporating genome browsers such as ExAC.
  • Or a different kind of genome browser like IGV.
  • Familial relationships, for example as in Family Genome Browser (FBG)

Some examples of software I am unsure of:

Any input on this topic would be greatly appreciated.

ADD COMMENTlink modified 21 months ago • written 21 months ago by olavur80
1

Are you willing to purchase or looking for a free solution?

ADD REPLYlink modified 21 months ago • written 21 months ago by genomax68k

Purchasing is an option.

ADD REPLYlink written 21 months ago by olavur80
1

While a system of this type sounds simple, getting freeware or an off the shelf commercial solution to fit your internal business practices can easily become a huge pain in the you know what. Most times this is because of unwillingness of locals to change their business practices/inability of map existing practices onto a ready-made solution. This is guaranteed to cause pain for many unless you have plenty of resources (i.e. developers) to throw at this.

Looking at your user profile you seem to be at an institution that is in this for the long term. So if you have internal developer resources, then putting a solution together that fits your needs (keeping very simple/realistic goals, which is extremely important) may prove to be the best solution.

Also take a look at this old thread: Is there a Lims that doesn't suck? Issues mentioned in that thread (unfortunately) remain current. But it does have useful information about various packages.

ADD REPLYlink modified 21 months ago • written 21 months ago by genomax68k

Most times this is because of unwillingness of locals to change their business practices/inability of map existing practices onto a ready-made solution.

We are sort of building everything from the ground up, so I don't know how much we have to adapt to existing practices.

This is guaranteed to cause pain for many unless you have plenty of resources (i.e. developers) to throw at this.

We don't have a lot of resources, so we can't expect to develop a complex system for ourselves.

So if you have internal developer resources, then putting a solution together that fits your needs (keeping very simple/realistic goals, which is extremely important) may prove to be the best solution.

If there is no suitable system we can buy, then perhaps we need to consider developing our own. But most likely I will have to do it myself, so it will have to be very simple.

ADD REPLYlink written 21 months ago by olavur80
1

I think you should separate raw data from accessible information. The raw data can be stored in bam files for instance on a slow file system. The accessible data would be the SNPs, coverage etc. It should be pre-computed or computed on request and then loaded to the database. In my opinion there is no reason in having bam files accessible. You obviously narrow down your analysis results to pre-defined questions or request some time to generate the relevant data but the saving in fast storage is huge. If you are looking for a commercial solution you can check out SQREAM, they have (or at least had) dedicated solutions for systems like you described. Good luck

ADD REPLYlink written 21 months ago by Asaf5.7k

Good point, accessing sequence formats such as FASTQ and BAM will be rare, but not non-existent.

ADD REPLYlink written 21 months ago by olavur80

PathOS has some of the functionality you are looking for. You can search for patients (maybe their metadata?), VCFs are displayed, IGV is incorporated for aligned read display. Their paper here.

Also, molgenis and their NGS modules might be of use.

ADD REPLYlink modified 21 months ago • written 21 months ago by Robert Sicko570

Thanks, Molgenis is exactly the kind of thing I need. It seems to have very advanced data management features, and is also geared towards biobanks.

ADD REPLYlink written 21 months ago by olavur80

However, it seems very complex, and being an open source and most likely government funded project, I'm not sure I can expect much in terms of stability and long-term support.

ADD REPLYlink written 21 months ago by olavur80

I'm not sure I can expect much in terms of stability and long-term support.

That is a given for pretty much all software. That is one of the reasons one is expected to pay for the value-add that a supporting entity guarantees, even though the software itself may be free.

ADD REPLYlink written 21 months ago by genomax68k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2207 users visited in the last hour