Question: Bigger Sample Sizes For Wgs Exome. Is Nosql The Way To Go? Or Bio Hdf
1
gravatar for Kevin
7.8 years ago by
Kevin630
Kevin630 wrote:

Just thinking out loud. Given that sequencing datasets are getting bigger. Would it be more convenient in the long run to use NoSQL to store fastq reads instead of using Bio-HDF

I understand that Big Data analytics and it's various optimization has been around for a while. Do we need a different way for dealing with NGS data? Why not tap on existing infrastructure?

while we may not require real time analytics. but with more sequencing being done I wonder if some of the existing analysis is being done the way it is done because we want to avoid thinking / using all of the big fastq and bam files.

While I think that BioHDF is a fantastic way to wrap meta data into a binary for storage. Unless u have a multi-CPU server, it's going to be difficult to parallelize, so it feels like a stop gap measure.

Implementing in NoSQL might also simplify access to the data since it would likely be core skills

fastq next-gen sequencing • 3.4k views
ADD COMMENTlink written 7.8 years ago by Kevin630
2

I believe anything that let's us abandon the terrible FASTQ 'format' and allows for a more efficient and structured representation will be beneficial.

ADD REPLYlink written 7.8 years ago by Dr. Mabuse47k

Amen to that for VCF. But for FASTQ, what kind of structured information are you thinking about, outside of { name, sequence, qual } ?

ADD REPLYlink written 7.8 years ago by Pierre Lindenbaum124k

I wouldn't think about any additional information. I primarily think about something that really is a format in the first place, this requiring a stringent format definition in the first place. Its syntax should be formalized in way that allows parsing, e.g. EBNF, XML schema, whatever. (Actually the first thing I would do is prescribe that each fastq record consists of exactly 4 lines). To increase efficiency a binary format (which e.g. can be defined in HDF5 which makes it platform indep.) or even from scratch aka BAM would do much better, and such binary is well defined by design.

ADD REPLYlink written 7.8 years ago by Dr. Mabuse47k

That perhaps I should have described the work environment as such, http://www.hpcinthecloud.com/hpccloud/2012-02-29/cloud_computing_helps_fight_pediatric_cancer.html before asking if NoSQL might excel over HDF .. quoted from the url "Before they'd have to ship hard drives to each other to have that degree of collaboration and now the data is always accessible through the cloud platform.

"We expect to change the way that the clinical medicine is delivered to pediatric cancer patients, and none of this could be done without the cloud," Coffin says emphatically. "With 12 cancer centers collabor

ADD REPLYlink modified 8 weeks ago by RamRS25k • written 7.8 years ago by Kevin630
8
gravatar for lh3
7.8 years ago by
lh331k
United States
lh331k wrote:

Re FASTQ: I quite like fastq. It is very concise and efficient to parse. The biggest problem with fastq is we are unable to store meta information. A proper format may be worthy. A few centers have replaced fastq with BAM. But even so, I think fastq will long live.

Re SAM/BAM and BioHDF: There is an effort to keep SAM with BioHDF. I know the conversion tools and the indexer were working two years ago, but it is still in alpha as of now. I tried the latest BioHDF briefly. It produced larger file size and invoked far more read/lseek system calls. This may be worrying if we simultaneously access alignments of 1000 individuals.

Re VCF: Unlike FASTQ and arguably SAM, VCF keeps structured data. I can see the point of improving VCF, but I am not convinced that we can do much better with NoSQL/HDF unless someone show me the right way.

Re specialized format vs. HDF/NoSQL/SQL: A generic database engine can hardly beat a specialized binary format. When file size or accessing speed is really critical, specialized formats such as SRA/BAM almost always win by large. On the other hand, coming up with an efficient and flexible binary format is non-trivial and takes a long time. If data are not frequently accessed by all end users (e.g. trace/intensity), a format built upon HDF/NoSQL is faster to develop and more convenient to access, and thus better.

Re HDF vs. NoSQL: HDF (not BioHDF) has wider adoption in biology. PacBio and NanoPore both adopted HDF to some level. Personally, I like HDF's hierarchical model better. Berkley DB is too simple. Most recent NoSQL engines are too young. It is yet to see how they evolve.

My general view is in NGS, HDF may be a good format for organizing internal data. I think PacBio and NanoPore are taking the right path (I wish Illumian could do the same at the beginning). However, it is not worth exploring NoSQL solutions for the existing end-user data. These solutions are very likely to make data even bigger in size, slower to process and harder to access especially for biologists. I am not sure how "bigger" are your exome data. 1000g has 50TB alignments in BAM. The system so far works. I do not think a generic NoSQL can work as well in this particular application.

ADD COMMENTlink modified 7.8 years ago • written 7.8 years ago by lh331k

Thanks for the very comprehensive answer! To me I think that the reason why it might be difficult to work with 1000G bams is precisely because of the size of the data. it's non-trivial to move 50 TB anywhere in the world.

i suggested NoSQL not because I felt it's better than HDF (or bam) in local centers. But I think that collaborative research or shared data or open data like in 1000G would benefit from a public NoSQL db of the raw reads, if centers around the world were accessing the data simultaneously. Although I agree that data transfer speed will still be an issue.

ADD REPLYlink written 7.8 years ago by Kevin630

As you said, the speed and throughput of network are not good enough for a centralized database. When we have the speed and throughput some day, a specialized solution will still win by large.

ADD REPLYlink written 7.8 years ago by lh331k
3
gravatar for Deniz
7.8 years ago by
Deniz140
Cambridge
Deniz140 wrote:

A FastQ file in a NoSQL database takes more space than a flat fastq.gz file. "Big Data Analytics" is just a buzzword, it doesn't mean anything. NGS has a lot more data than most so-called "big data" applications.

How does NoSQL simplify access to data? Now one needs to have a database travelling around when people try to send each other FastQ files.

ADD COMMENTlink written 7.8 years ago by Deniz140
1

Compression is magical. A compressed SAM is usually smaller than the equivalent BAM, although the uncompressed SAM is much larger than the uncompressed BAM. When the data is compressible, simply changing the encoding without an advanced compression algorithm (LZ/Huffman/...) is less effective.

ADD REPLYlink written 7.8 years ago by lh331k

it makes sense if you need some indexes for the sequences/names/positions.

ADD REPLYlink written 7.8 years ago by Pierre Lindenbaum124k

Your first statement is void, because no specific 'NoSQL' system is mentioned. Do you really think that storing data which can be encoded in 2bits (nucleotide) + 4-8bits (short integer quality score) per base call in a text file is efficient? Compression is not the solution because also the efficient encoding can be compressed.

ADD REPLYlink written 7.8 years ago by Dr. Mabuse47k

Compression should come in addition by using compression on the efficient encoding. From our experience storing data in HDF5 with correct data types and with HDF5 built-in compression turned on could drastically reduce the size of the resulting files even when compared to compressed text files (was around factor 10 for our example).

ADD REPLYlink written 7.8 years ago by Dr. Mabuse47k

Out of curiosity, what is the format of your text file? A factor of 10 rarely happens unless the text file has very unusual structures or redundancies.

ADD REPLYlink written 7.8 years ago by lh331k

That were hapmap LD files, e.g like this one (http://hapmap.ncbi.nlm.nih.gov/downloads/ld_data/latest/ld_chr1_CEU.txt.gz). The consist of positional information (integer), some numeric float values (r, r2), and the 2 rs ids (aka "rs1234567"). We stored the rs ids as integer values too. The factor 10 was an estimate, maybe it was overestimated. Storing all LD data of populations (CEU,CHB, JPT, YRI) for all chromosomes that way result in a 4.4GB HDF5 file.

ADD REPLYlink written 7.8 years ago by Dr. Mabuse47k

Btw, that file is here http://www.bccs.uni.no/~mdo041/Downloads/hapmapLD.h5

ADD REPLYlink written 7.8 years ago by Dr. Mabuse47k

MD: Have not seen your reply until now. Thanks for that!

ADD REPLYlink written 7.8 years ago by lh331k
2
gravatar for Pierre Lindenbaum
7.8 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum124k wrote:

I tried to use HDF5 and I found the C API was too difficult for my needs. Althought I don't need to store this kind of data, I've played with BerkeleyDB to store such sequences. Something like: http://plindenbaum.blogspot.com/2009/03/string-challenge-my-brute-force.html

ADD COMMENTlink modified 10 weeks ago by RamRS25k • written 7.8 years ago by Pierre Lindenbaum124k
1

We have found the Java API to HDF5 to be more high-level and much easier to use than what you describe in your blog. We have used it to generate HDF5 files that contain all HapMap SNP pairs with LD informations. These are subsequently read into R using the hdf5 package in R.

ADD REPLYlink written 7.8 years ago by Dr. Mabuse47k
1

at what point does RDBMS become too slow? 50,000 rows?

ADD REPLYlink written 7.8 years ago by Jeremy Leipzig18k
1

Jeremy, 50,000 seems a bit low. MySQL for example can handle at least millions of rows if not billions. It's not really the sheer number of entries but join operations that make for the cost. On the other hand, if you do not have to model relations (foreign keys) in your data, you don't need a relationalDBMS, you could still use it but it's not required, and simpler system like key-value store might be more efficient (but doesn't have to be) because of reduced overhead.

ADD REPLYlink written 7.8 years ago by Dr. Mabuse47k

it depends of your implementation/hardware/cluster/mem-cached, etc... the UCSC/Genome browser efficiently uses mysql. But sometimes a nosql solution is more elegant (e.g: storing a structured data as an array of bytes with BDB)

ADD REPLYlink written 7.8 years ago by Pierre Lindenbaum124k

I heard IGV tries to move away from HDF5 because it does not have native Java APIs. For most queries (including join), mysql/oracle is efficient enough for a web server given 1 million rows.

ADD REPLYlink written 7.8 years ago by lh331k

I heard IGV tries to move away from HDF5 because it does not have native Java APIs. For most queries (including join), mysql/oracle is efficient enough for a web server given 1 million rows. Whether it is fast enough for a particular application depends on the pattern of queries.

ADD REPLYlink written 7.8 years ago by lh331k
0
gravatar for SES
7.8 years ago by
SES8.2k
Vancouver, BC
SES8.2k wrote:

Is there a reason SQLite has not been considered as a possible solution? From what I have read, it can handle large data just fine and the footprint is much smaller compared to these other solutions.

Also, I'm not convinced BerkeleyDB will be a good solution for NGS data. There are some BioPerl modules that use DB_File (Bio::Index::Fastq, for example), and its limitations for NGS data have been discussed numerous times. I can't say specifically if the performance issue is DB_File, the module itself or BerkeleyDB, but that method is extremely slow for even reading through a large Fastq.

ADD COMMENTlink written 7.8 years ago by SES8.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 934 users visited in the last hour