We have developed a new tool called GEMINI (GEnome MINIng) to facilitate the exploration of genetic variation in the context of a wide range of genome annotations that are crucial to interpretation and prioritization. Unlike existing tools, GEMINI integrates genetic variation with a diverse and flexible set of genome annotations (e.g., dbSNP, ENCODE, UCSC, ClinVar, KEGG) into a unified database to facilitate interpretation and data exploration.
By loading both genetic variants in VCF format and genome annotations into a unified SQLite database, GEMINI allows researchers to compose complex queries based on sample genotypes, inheritance patterns, and both pre-installed and custom genome annotations. GEMINI also provides methods for ad hoc queries and data exploration, a simple programming interface for custom analyses that leverage the underlying database, and both command line and graphical tools for common analyses. GEMINI is well-suited to exploring variation in personal genomes and family based genetic studies, and it scales to studies involving thousands of human samples. GEMINI is designed for reproducibility and flexibility and our goal is to provide researchers with a standard framework for medical genomics.
The GEMINI project was conceived in the Quinlan lab, but it has also benefited from fantastic collaborations with Brad Chapman, Rory Kirchner, and Oliver Hofmann at the Harvard School of Public Health
To get started with GEMINI, one needs a valid VCF file based on Human Genome coordinates from Build 37 (hg19) of the human genome. We expect that you have annotated with VCF with either snpEff (instructions here) or VEP (instructions here). You then simply load the VCF into GEMINI with the
load command. This populates a GEMINI database with the variants and automatically annotates variants all built-in annotations.
# assumes VCF has been annotated by snpEff $ gemini load -v my.vcf -t snpEff my.gemini.db
One can also provide a PED file to define relationships among samples (useful for finding variants that meet expected inheritance patterns) and to define the sex and disease status of the samples.
$ gemini load -v my.vcf -t snpEff -p my.ped my.gemini.db
Loading is very computationally expensive; therefore, the work can easily be distributed among either multiple CPUs on a single machine:
$ gemini load --cores 8 -v my.vcf -t snpEff my.gemini.db
or distributed on a computing cluster that leverages either SGE, LSF or Torque:
# LSF $ gemini load --cores 128 --lsf-queue my_bigbad_queue -v my.vcf -t snpEff my.gemini.db # SGE $ gemini load --cores 128 --sge-queue my_bigbad_queue -v my.vcf -t snpEff my.gemini.db # Torque $ gemini load --cores 128 --torque-queue my_bigbad_queue -v my.vcf -t snpEff my.gemini.db
Once loaded, one can begin exploring genetic variation using either the "query" interface (see here for more details):
$ gemini query -q "select chrom, start, end, ref, alt from variants \ where is_lof = 1 \ and aaf >= 0.01" my.gemini.db
For example, to select genotypes for a specific sample (sample1):
$ gemini query -q "select chrom, start, end, ref, alt, gts.sample1 from variants \ where is_lof = 1 \ and aaf >= 0.01" my.gemini.db
One can also apply genotype filters with the
gt-filter option. This will return only those variants that meet the specific genotype criteria you enforce. Here is an example of a filter that enforces an autosomal recessive inheritance pattern. Note that these patterns follow Python syntax.
$ gemini query -q "select chrom, start, end, ref, alt, gts.mom, gts.dad, gts.kid from variants \ where is_lof = 1 and aaf >= 0.01" \ --gt-filter "gts.dad == HET and gts.mom == HET and gts.kid == HOM_ALT" \ my.gemini.db
In addition, there are many built-in tools for conducting common analyses and finding variants that meet inheritance patterns that make sense for the phenotype you are studying. Please see here for more details.
Find de novo variants
$ gemini de_novo my.gemini.db
Find variants meeting an autosomal recessive inheritance pattern
$ gemini autosomal_recessive my.gemini.db
Find variants meeting an autosomal dominant inheritance pattern
$ gemini autosomal_dominant my.gemini.db
Lastly, we see GEMINI as a framework for researchers to develop their own new tools, and methods. We see the GEMINI database as the "API" and given that SQLite databases are portable, the code you develop based upon the Python API will work on any GEMINI database.
from gemini import GeminiQuery gq = GeminiQuery("my.db") gq.run("select chrom, start, end from variants") for row in gq: print row
We are constantly adding features, yet if there is something you would like to see added, please let us know (preferably using the mailing list).