Tool: Gemini: Integrative Exploration Of Genetic Variation And Genome Annotations
14
gravatar for Aaronquinlan
6.0 years ago by
Aaronquinlan10k
United States
Aaronquinlan10k wrote:

We have developed a new tool called GEMINI (GEnome MINIng) to facilitate the exploration of genetic variation in the context of a wide range of genome annotations that are crucial to interpretation and prioritization. Unlike existing tools, GEMINI integrates genetic variation with a diverse and flexible set of genome annotations (e.g., dbSNP, ENCODE, UCSC, ClinVar, KEGG) into a unified database to facilitate interpretation and data exploration.

By loading both genetic variants in VCF format and genome annotations into a unified SQLite database, GEMINI allows researchers to compose complex queries based on sample genotypes, inheritance patterns, and both pre-installed and custom genome annotations. GEMINI also provides methods for ad hoc queries and data exploration, a simple programming interface for custom analyses that leverage the underlying database, and both command line and graphical tools for common analyses. GEMINI is well-suited to exploring variation in personal genomes and family based genetic studies, and it scales to studies involving thousands of human samples. GEMINI is designed for reproducibility and flexibility and our goal is to provide researchers with a standard framework for medical genomics.

Source code || Documentation || Manuscript || Overview Presentation || Installation || Mailing list

The GEMINI project was conceived in the Quinlan lab, but it has also benefited from fantastic collaborations with Brad Chapman, Rory Kirchner, and Oliver Hofmann at the Harvard School of Public Health

To get started with GEMINI, one needs a valid VCF file based on Human Genome coordinates from Build 37 (hg19) of the human genome. We expect that you have annotated with VCF with either snpEff (instructions here) or VEP (instructions here). You then simply load the VCF into GEMINI with the load command. This populates a GEMINI database with the variants and automatically annotates variants all built-in annotations.

# assumes VCF has been annotated by snpEff
$ gemini load -v my.vcf -t snpEff my.gemini.db

One can also provide a PED file to define relationships among samples (useful for finding variants that meet expected inheritance patterns) and to define the sex and disease status of the samples.

$ gemini load -v my.vcf -t snpEff -p my.ped my.gemini.db

Loading is very computationally expensive; therefore, the work can easily be distributed among either multiple CPUs on a single machine:

$ gemini load --cores 8 -v my.vcf -t snpEff my.gemini.db

or distributed on a computing cluster that leverages either SGE, LSF or Torque:

# LSF
$ gemini load --cores 128 --lsf-queue my_bigbad_queue -v my.vcf -t snpEff my.gemini.db

# SGE
$ gemini load --cores 128 --sge-queue my_bigbad_queue -v my.vcf -t snpEff my.gemini.db

# Torque
$ gemini load --cores 128 --torque-queue my_bigbad_queue -v my.vcf -t snpEff my.gemini.db

Once loaded, one can begin exploring genetic variation using either the "query" interface (see here for more details):

$ gemini query -q "select chrom, start, end, ref, alt from variants \
                  where is_lof = 1 \
                  and aaf >= 0.01" my.gemini.db

In particular, see the section on accessing and filtering upon sample genotype information.

For example, to select genotypes for a specific sample (sample1):

$ gemini query -q "select chrom, start, end, ref, alt, gts.sample1 from variants \
                  where is_lof = 1 \
                  and aaf >= 0.01" my.gemini.db

One can also apply genotype filters with the gt-filter option. This will return only those variants that meet the specific genotype criteria you enforce. Here is an example of a filter that enforces an autosomal recessive inheritance pattern. Note that these patterns follow Python syntax.

$ gemini query -q "select chrom, start, end, ref, alt, gts.mom, gts.dad, gts.kid from variants \
                            where is_lof = 1 and aaf >= 0.01" \
               --gt-filter "gts.dad == HET and gts.mom == HET and gts.kid == HOM_ALT" \
               my.gemini.db

In addition, there are many built-in tools for conducting common analyses and finding variants that meet inheritance patterns that make sense for the phenotype you are studying. Please see here for more details.

Find de novo variants

$ gemini de_novo my.gemini.db

Find variants meeting an autosomal recessive inheritance pattern

$ gemini autosomal_recessive my.gemini.db

Find variants meeting an autosomal dominant inheritance pattern

$ gemini autosomal_dominant my.gemini.db

Lastly, we see GEMINI as a framework for researchers to develop their own new tools, and methods. We see the GEMINI database as the "API" and given that SQLite databases are portable, the code you develop based upon the Python API will work on any GEMINI database.

from gemini import GeminiQuery
gq = GeminiQuery("my.db")

gq.run("select chrom, start, end from variants")
for row in gq:
    print row

We are constantly adding features, yet if there is something you would like to see added, please let us know (preferably using the mailing list).

vcf genome database variation tool • 5.9k views
ADD COMMENTlink modified 19 months ago by elsayedhegazy20 • written 6.0 years ago by Aaronquinlan10k

@Aaronquinlan: Thanks for sharing: looks interesting tool; need to try...

ADD REPLYlink written 6.0 years ago by Rm7.8k

I am adding this to our home grown LIMS!

ADD REPLYlink written 6.0 years ago by Istvan Albert ♦♦ 80k

Please let us know if you have any troubles or suggestions.

ADD REPLYlink written 6.0 years ago by Aaronquinlan10k

I am interested in using GEMINI to store and annotate CNVs. However, am I reading the documentation right in that only the most highly affected transcript would be stored? Perhaps I could still store the CNVs and some annotation in the SQLite DB and annotate for gene overlap on the fly...

ADD REPLYlink written 4.6 years ago by Robert Sicko570

The impact on othr transcripts is stored in the variant_impacts table.

ADD REPLYlink written 4.4 years ago by Aaronquinlan10k

How do you see Gemini can handle population (1000s of gVCFs) level human WES or WGS data?

ADD REPLYlink written 4.4 years ago by Rm7.8k

The input to GEMINI is a single VCF, which can be created by combining your 1000s of gVCFs. Currently, it will perform fairly will for exome studies of 1000s of samples, but not too well for genome. That said, we are working on a new version that will easily scale to 1000s for WGS.

ADD REPLYlink written 4.4 years ago by Aaronquinlan10k

Thanks @Aaronquinlan: I will test it with ~8000 combined gVCFs from WES and will update how it goes with the current version. 

ADD REPLYlink written 4.4 years ago by Rm7.8k
1
gravatar for Aaronquinlan
5.8 years ago by
Aaronquinlan10k
United States
Aaronquinlan10k wrote:

The manuscript for GEMINI is available at: http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003153 And there is a high level description of it in this video: http://www.youtube.com/watch?feature=player_embedded&v=p-UWmDG6yj4

ADD COMMENTlink written 5.8 years ago by Aaronquinlan10k

Hi Sir,

Can you please tell me how can I call gemini api through the windows machine, from scratch like 'gemini load ' command API and querying database api etc.

 

Thanks,

 

Nilesh 

ADD REPLYlink written 3.8 years ago by nbhbiotech.hake0

please don't ask questions in the comment section of a post. Post a new question if you have one.

ADD REPLYlink written 3.8 years ago by Istvan Albert ♦♦ 80k
0
gravatar for Istvan Albert
6.0 years ago by
Istvan Albert ♦♦ 80k
University Park, USA
Istvan Albert ♦♦ 80k wrote:

Just an idea: there is this user friendly query language called HTSQL that might be very suited to opening up GEMINI to less technically inclined people.

HTSQL a comprehensive navigational query language for relational databases.

ADD COMMENTlink written 6.0 years ago by Istvan Albert ♦♦ 80k

Looks interesting, though it seems geared towards making SQL less "hard". I see SQL as one of the most intuitive languages around...most biologists that I know who lack programming skills find SQL easy to understand. Have you seen otherwise?

ADD REPLYlink written 6.0 years ago by Aaronquinlan10k

Simple selects have a easy syntax that would be hard to improve on. But once you have joins and grouping it gets very unforgiving and mistakes are hard to spot. I have quite a hard time building these myself if I haven't used SQL in a while. Usually I feel that I need to retrain myself after a few months of not doing SQL.

Compare the two, the HTSQL:

/department{name, max(course.credits)}

versus direct SQL:

SELECT "department"."name",
       "course"."max"
FROM "ad"."department"
     LEFT OUTER JOIN (SELECT MAX("course"."credits") AS "max",
                             "course"."department_code"
                      FROM "ad"."course"
                      GROUP BY 2) AS "course"
                     ON ("department"."code" = "course"."department_code")
ORDER BY "department"."code" ASC
LIMIT 10000
ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by Istvan Albert ♦♦ 80k
0
gravatar for Roman Valls Guimerà
6.0 years ago by
Melbourne
Roman Valls Guimerà510 wrote:

Are there any plans to abstract the job queueing interface with something like DRMAA? The *-queue flags seem a bit redundant.

http://www.drmaa.org/

ADD COMMENTlink written 6.0 years ago by Roman Valls Guimerà510

Thanks. I was unaware of this. We are currently using IPython-parallel to handle the distributed computing. DRMAA may be an option but I need to spend some time reading up on it.

ADD REPLYlink written 6.0 years ago by Aaronquinlan10k
0
gravatar for Roman Valls Guimerà
5.6 years ago by
Melbourne
Roman Valls Guimerà510 wrote:

I wonder if there are plans (if not supported somehow already), to support coverage, like Chanjo does:

https://chanjo.readthedocs.org/en/latest/

ADD COMMENTlink written 5.6 years ago by Roman Valls Guimerà510

Chanjo looks very interesting...we will look into it.

ADD REPLYlink written 5.6 years ago by Aaronquinlan10k
0
gravatar for elsayedhegazy
19 months ago by
elsayedhegazy20 wrote:

Is there any way to make gemini work with hg38 ?

ADD COMMENTlink written 19 months ago by elsayedhegazy20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 895 users visited in the last hour