While in general, I agree with Heng's points, we have found the use of Tabix and vcftools very useful for some analyses but a bit cumbersome for more intricate explorations of genetic variation. We find this to be especially true when one wants to put those variants in broader context through comparisons to many different genome annotations.
Our solution, while not yet final, is a new tool that we have been developing called gemini. Our hope for gemini is for it to be used as a standard framework for exploring genetic variation on both family-based studies of disease and for broader population genetic studies. The basic gist is that you load a VCF file into a Gemini database (SQLite is the backend for portability and flexibility). As each variant is read from the VCF, it is annotated via comparisons to many different genome annotation files (e.g., ENCODE, dbSNP, ESP, UCSC, ClinVar, KEGG). Tabix is used to expedite the comparisons. The variants and the associated annotations are stored in a variants table. The attractive aspect to us is that the database framework itself is the API. Once the data is loaded, one can ask quite complex questions of one's data. With the help of Brad Chapman and Rory Kirchner, we have parallelized the loading step, which is, by far, the slowest aspect. To use multiple cores, one would do:
gemini load -v my.vcf --cores 20 my.db
One can also farm the work out to LSF, SGE, and Torque clusters. Once loaded, one can query the database via the command line:
gemini query -q "select chrom, start, end, ref, alt, \
aaf, hwe, in_dbsnp, is_lof, impact, num_het \
from variants" \
Also, we represent sample genotype information in compressed numpy arrays that are stored as binary BLOB columns to minimize storage and allow scalability (i.e., as opposed to having a genotypes _table_ where the number of rows is N variants * M samples: bad). As such, we have extended the SQL framework to allow struct-like access to the genotype info. For example,
the following query finds rare, LoF variants meeting an autosomal recessive inheritance model (note that the
--gt-filter option uses Python, not SQL syntax).
gemini query -q “select chrom, start, end,
ref, alt, gene,
impact, aaf, gts.kid
where in_dbsnp = 0
and aaf < 0.01
and in_omim = 1
and is_lof = 1”
“gt_types.mom = HET
gt_types.dad = HET
gt_types.kid = HOM_ALT”
There is also a Python interface allowing one to write custom Python scripts that interface with the Gemini DB. An example script can be found at: http://gist.github.com/arq5x/5236411.
There are many built-in tools for things such as finding compound hets, de novo mutations, protein-protein interactions, pathway analysis, etc.
Gemini scales rather well. We recently loaded entire 1000 Genomes VCF (1092 samples) in 26 hours using 30 cores. The queries are surprisingly fast. While this scale is not our current focus (we are more focused on medical genetics), it is encouraging to see it work well with 100s of individuals.
If interested, check out the documentation. Comments welcome.