Question: Big Data: Storage And Analysis
gravatar for ruphos
9.2 years ago by
Los Angeles, CA
ruphos100 wrote:

A question for those of you out there working on larger bioinformatics analysis:

What sort of platform are you using to store your data and how are you integrating that into your analysis?

My work is part of a larger network of researchers pooling their data. As a result, people need to be able to join their datasets with others' to perform combined analysis, from different parts of the country. Is anyone else in a similar situation? How are you addressing those needs?

For reference, we have genotype data on ~2000 individuals. A mix of 550k and OMNI 1M chips, depending on the run. We have numerous datasets of various phenotypes relating to our area of study for most of those individuals to do trait analysis. We've mostly been doing GWAS with PLINK, but will be doing more with IMPUTE, PennCNV, STRUCTURE and similar applications in the near future. We will also soon be handling whole exome data for several hundred subjects.

data • 3.7k views
ADD COMMENTlink modified 9.0 years ago by 188860 • written 9.2 years ago by ruphos100

I know your pain :-)

ADD REPLYlink written 9.2 years ago by Pierre Lindenbaum125k

Might find this questions of use: [?]Using HDF5 to store bio-data[?]

ADD REPLYlink written 9.2 years ago by Blunders1.1k

Might find this question of use, since it appears you're able to use Hadoop and HDF5 together... Using HDF5 to store bio-data.

ADD REPLYlink modified 4 months ago by RamRS25k • written 9.2 years ago by Blunders1.1k
gravatar for Pierre Lindenbaum
9.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum125k wrote:

My two cents:

You should have a look at Deepak Singh's slides about the world of Big Data , Amazon , Hadoop etc...

Galaxy can be installed on your server(s) and it allows your users to merge/join/etc the NGS data.

On my side, we are currently working on some exome data and I usually handle those data with BerkeleyDB-JE (instead of a classical RDBMS).

As some physicians want to have a closer look at the data, I've created a java webstart application to allow them to view the data (just the VCFs, a few Mo) via a graphical interface.

ADD COMMENTlink written 9.2 years ago by Pierre Lindenbaum125k

Galaxy looks neat, I'll definitely have to check into that one more. We were looking at key-value based systems, but decided it would take more development than we had resources for. I'm thinking it might still be something to consider as we get into exome data. Thanks!

ADD REPLYlink written 9.2 years ago by ruphos100
gravatar for apfejes
9.2 years ago by
apfejes160 wrote:

This sounds exactly like one of my projects. I've developed a database for combining large datasets of snvs and indels, with a java (command line) API for querying against the dataset. It's meant to be used as part of a larger pipeline, but is often used as a stand alone tool. I started it because I was unable to find any other tool that could be used to efficiently compare any number of data sets in a reasonable amount of time.

The trick is really in the implementation, though - looking at 2000 individuals at once and making sense of the trends isn't as easy as it sounds, but it can be done. (We're up to almost 1500 genome, exome and transcriptome libraries, so not quite the 2000 you've got, but at several million snps per library, it adds up fast.)

Anyhow, my database template and the api are open source, and we're just getting ready to submit an application note - I can give you more information if you want, but I'd hate to be spamming my own work here.

ADD COMMENTlink written 9.2 years ago by apfejes160

Don't tease us -- definitely give a URL to your work if it's open source and you feel comfortable sharing it. There's nothing wrong with promoting good work you've done, especially when it answers the question.

ADD REPLYlink written 9.2 years ago by Brad Chapman9.5k

I'm looking into all sorts of options, just to get an idea of what's out there and what other people are doing if nothing else. I'd be more than happy to take a look at your project.

ADD REPLYlink written 9.2 years ago by ruphos100

As Brad puts it if it is already open source why not give out the link to it.

ADD REPLYlink written 9.2 years ago by Istvan Albert ♦♦ 82k

I obviously don't visit enough, since I didn't know there were comments. My work is part of the "Vancouver Short Read Analysis Package" on Sourceforge.

Sorry for taking so long to reply.

ADD REPLYlink written 9.2 years ago by apfejes160
gravatar for 1888
8.7 years ago by
United States
188860 wrote:

Hi all genetics/genomics researchers,

Just to continue on this thread...Are there any databases availablle to pool all the exomes data that the different groups are generating? I know it is not easy to put this public, because sequencing is still so expensive, but I think it is time to do that to understand more about diseases...Don't you think?

ADD COMMENTlink written 8.7 years ago by 188860

please, open a new question.

ADD REPLYlink written 8.7 years ago by Pierre Lindenbaum125k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 818 users visited in the last hour