8.6 years ago by
Re FASTQ: I quite like fastq. It is very concise and efficient to parse. The biggest problem with fastq is we are unable to store meta information. A proper format may be worthy. A few centers have replaced fastq with BAM. But even so, I think fastq will long live.
Re SAM/BAM and BioHDF: There is an effort to keep SAM with BioHDF. I know the conversion tools and the indexer were working two years ago, but it is still in alpha as of now. I tried the latest BioHDF briefly. It produced larger file size and invoked far more read/lseek system calls. This may be worrying if we simultaneously access alignments of 1000 individuals.
Re VCF: Unlike FASTQ and arguably SAM, VCF keeps structured data. I can see the point of improving VCF, but I am not convinced that we can do much better with NoSQL/HDF unless someone show me the right way.
Re specialized format vs. HDF/NoSQL/SQL: A generic database engine can hardly beat a specialized binary format. When file size or accessing speed is really critical, specialized formats such as SRA/BAM almost always win by large. On the other hand, coming up with an efficient and flexible binary format is non-trivial and takes a long time. If data are not frequently accessed by all end users (e.g. trace/intensity), a format built upon HDF/NoSQL is faster to develop and more convenient to access, and thus better.
Re HDF vs. NoSQL: HDF (not BioHDF) has wider adoption in biology. PacBio and NanoPore both adopted HDF to some level. Personally, I like HDF's hierarchical model better. Berkley DB is too simple. Most recent NoSQL engines are too young. It is yet to see how they evolve.
My general view is in NGS, HDF may be a good format for organizing internal data. I think PacBio and NanoPore are taking the right path (I wish Illumian could do the same at the beginning). However, it is not worth exploring NoSQL solutions for the existing end-user data. These solutions are very likely to make data even bigger in size, slower to process and harder to access especially for biologists. I am not sure how "bigger" are your exome data. 1000g has 50TB alignments in BAM. The system so far works. I do not think a generic NoSQL can work as well in this particular application.
modified 8.6 years ago
8.6 years ago by
lh3 ♦ 32k