VCF is ok as a suitcase for small-scale variation and to a lesser extent, annotation. But you can't live out of your suitcase forever.
VCF isn't a database, and will never support region and sample queries at scale or at "web-speed" in the era of national biobanks. Even its usefulness in transmitting variants is not sustainable past a few thousand samples. Annotation can also be problematic given that everything needs to be serialized into the INFO field. The shift away from joint genotyping and toward single sample gVCFs as the preferred currency further muddies the waters.
There are three or four major types of successors to VCFs as variant warehouses that are worth mentioning.
Spark-based (requires a Spark cluster to scale):
- hail.MatrixTable - based on Parquet. Hail powers a number of analyses on gnomAD, UK Biobank, and other large genomic datasets.
- Glow Spark Dataframe - based on Spark and DeltaLake, Glow offers GloWGR, a distributed version of the regenie GWAS package. Provides user-defined functions (UDFs) and variant normalization functions.
Cloud-vendor managed solutions
Distributed SQL & NoSQL
- OpenCGA - open-source project for storing variant data and associated metadata in MongoDB
- Snowflake - closed-source distributed SQL engine
Multidimensional array based
SciDB - closed-source platform. Hosts large datasets including UK Biobank.
TileDB-VCF (requires a TileDB-Cloud account to scale) - an open source python package that uses serverless TileDB multidimensional arrays indexed on chr, pos, and sample. TileDB-VCF on TileDB-Cloud powers real-time queries for variant browsers as well as large notebook-based analyses that use task graphs in conjunction with UDFs. Disclaimer: I am the product manager for TileDB-VCF.
These solutions have vastly different performance, flexibility, and portability characteristics, as well as different cost structures, infrastructure needs, and varying levels of support for gVCF ref/no-call ranges (the n+1 problem), SVs, and pangenomic graph-based representations. It seems likely the growing interest in multi-omics - combining analyses of genomic variation with transcriptomics, proteomics, cytomics, and imaging - will also shape the future of variant warehouses.