Erik Garrison received his Bachelors degree in 2006 majoring in Social Studies at Harvard University. He then went on to work as a Software Engineer on the One Laptop Per Child project then, later, worked as a Research Assistant analyzing Wikipedia data dumps. Subsequently he was hired as a Research Assistant/Contractor at the Harvard Medical School where he designed, wrote, and tested data acquisition and system control software for the "Polonator" open-source DNA sequencing device.
Between 2010 and 2014 Erik Garrison worked as a Research Associate at Boston College where he wrote FreeBayes, the first haplotype and graph-based variant detection method for short-read sequencing data. While at Boston College he also became a major contributor to the 1000 Genomes Project in the areas of variant detection, data integration, and functional interpretation.
Erik Garrison of FreeBayes
How did you get started in bioinformatics?
I grew up in Kentucky, in an small-town environment where community was the most important thing. When I went to college, I decided to study the social sciences, thinking that this would provide me with a broad perspective for a future life in public service. What happened was quite different--- I found myself attracted to open source initiatives, and focused my thesis research on them. At the same time, I developed my capacity as a quantitative researcher and began working with very large (well, only multi-terabyte) data sets produced by projects like Wikipedia.
The skills and connections I collected in my last years at university provided me the opportunity to work in an open source project in genomics: the Polonator sequencing platform that was developed in George Church's lab. Although the project did not take off, I learned an incredible amount about genomics and sequencing through this process, which left me with a head start in understanding these systems. I put this to use several years later when I started working with Gabor Marth at Boston College. Gabor put me to work maintaining the variant detection and genotyping method he'd developed (PolyBayes -> GigaBayes -> BamBayes), which I developed into freebayes.
The subtext to the above is that I work in this space because I see it as a place where people working in the open can have an enormous benefit to many others. This is a big part of my attraction to the 1000 Genomes Project. I'm also proud to support a number of free software projects, like freebayes, that provide users with a lot of flexibility and support a really diverse array of research.
What hardware do you use?
I use a Lenovo X1 Carbon and a Samsung Galaxy Note 3 as my primary interfaces for almost all of my work.
I use a TypeMatrix keyboard and I type dvorak.
What is your text editor?
I use both vim and emacs. If it's anything more than a few quick edits or a read through some source code, I'll use emacs.
What software do you use for your work?
Linux, zsh, mosh, tmux, emacs, git, R, LaTeX I run Linux on my laptop, which provides a great environment for developing bioinformatic tools. That said, I've tended to work remotely on servers where both code and very large data sets can live in harmony. To support this, I maintain mosh (a kind of fault and latency-tolerant ssh-over-UDP) connections with several servers so that I can just open my laptop up and begin hacking without restarting my connections.
I keep tmux running for added redundancy and fault-tolerance, even locally. It's an easy way to build up a persistent IDE from panels of shells, programs, and text editors. I also write a huge amount of email, mostly via gmail. And because I can't easily run MS Word on my laptop, I use Google Docs/drive for collaborative document editing.
What do you use to create plots and charts?
Mostly R, but I almost exclusively use ggplot2. R is great, but it can be hard to work with large-scale data in it. (Can someone develop a JIT, multithreading compiler for R so my plots will finish before I'm motivated to get up and get a coffee?)
What do you consider the best language to do bioinformatics with?
I think that dataflow-oriented languages like shell (bash, zsh, etc.) are the best platform for bioinformatics. Provided your software can read and write streams of data, it can be used in shell-based scripts to build analysis pipelines. I don't think that the actual analysis algorithms should be written in dynamic languages, but ensuring that that methods can be used in dynamic environments allows for the rapid composition of analyses and interactive manipulation of data.
Where things need to go fast, I write in C++. Today, it offers the best tradeoff between performance (the best) and expressiveness (I need very few lines of C++ to achieve a given end). However, I prefer to build large systems out of small pieces written in many languages. I find it somewhat difficult to do this in C++, and so I turn to more dynamic systems when I want to build and run analyses.
What bioinformatics tools/software do not get enough recognition?
- The Scalpel assembler, from Giuseppe Narzisi: A lot of groups have developed indel calling strategies based on local assembly, but these methods are relatively error-prone. The Scalpel paper includes a careful, validation-driven examination of issues related to detecting longer indels, and demonstrates how the method itself does a good job of managing many of these issues. In addition, it's a really well-written piece of software, and its assembly routines are modular and easy to interface with. (http://scalpel.sourceforge.net/)
- vt is a great set of tools for variant discovery and manipulation from Adrian Tan. I find myself constantly referring people to it because it includes functions to normalize variant representation in VCF, but there is a lot more there. It's a high-quality, full-featured library for working with variation information, and should be in everyone bioinformatic hacker's toolbox. (https://github.com/atks/vt)
- dat is a revision control system for data. It's not specific to bioinformatics, which is one of the reasons that I think more people in biology should be investigating it. Bioinformatics is not a large enough field to support biology-specific versions of everything we need! At this point I wouldn't recommend it as a solution, as it's still in alpha, but it is the first tool which I've found that at least intends to resolve one of the most basic problems I've encountered in my work in biology. Today it is easy to control the history of software, but not data. I suspect I'm not the only one on biostars who has experienced the pain that arises when you can't reliably reconstruct the history of a given data set. (http://dat-data.com/)
- I'm starting to work with node.js, which is also stream-oriented and has an incredibly extensive package repository (npm). There is an active project (bionode) exploring how to do bioinformatics in node, which integrates nicely with dat. I hope more people find their way to the project as web-based bioinformatics becomes more important. (http://bionode.io/)
See all post in this series https://www.biostars.org/t/uses-this/
To be notified of a new post in the series follow the first post: Jim Robinson of the Integrative Genomics Viewer (IGV) uses this