Cole Trapnell is an Assistant Professor at the University of Washington's Department of Genome Sciences where he directs a research lab that studies cell differentiation, reprogramming, and other transitions between stable or metastable cellular states.
Dr. Trapnell received his bachelor's degree as well as his PhD in Computer Science from the University of Maryland where he was co-advised by Steven Salzberg and Lior Pachter. He then trained as a postdoc in John Rinn's lab at Harvard's Department of Stem Cell and Regenerative Biology.
During his Ph.D. Dr. Trapnell wrote TopHat (fast junction mapper) and Cufflinks/CuffDiff (transcript assembler/differential expression quantificator) - the two software packages that ushered the era of widely accessible RNA-seq data analysis. Before these tools RNA-seq data analysis used to be a mysterious and elusive process. Tophat changed that. It offered a well documented manual, downloadable indices, databases and annotation files, binary executables for multiple platforms as well as many handy utilities.
It had provided a full solution in a standalone package where running an entire pipeline of detecting differentially expressed genes across multiple replicates and conditions takes little more than a couple of simple commands. In the history of bioinformatics few tools have had such a transformative effect on an entire scientific field.
Cole Trapnell of TopHat
How did you get started with programming/bioinformatics?
Before I got into science, I was a professional software engineer.
I started working as a programmer in high school for the US Army Research Lab. There, I learned to love Linux, and not for the reasons you might think. I was working on control software for this battlefield robot platform. Which sounds awesome, but really it meant I was trying to make this large, trashcan-shaped thing (no lasers, sorry) go forward a precise amount of distance. This was before wireless networking had really become pervasive, so I had the robot opened up, and the internal computer’s ethernet card connected to the wall jack by a cable stretched through the air over a distance of about 30 feet. This cable somehow (that is, through my carelessness) became draped over one of the large robot wheels. Not noticing this, I hit “go” to test my program, which caused the wheel to turn, and thus caught the ethernet cable in the large treads of the tire (battlefield-grade robot, remember?). The tire, which was moving fast enough to send the robot forward at 1 meter per second, which is fast, immediately snapped up the slack in the cable, at which point it would either be ripped out of the wall or out of the computer. Seeing this, I grab the cable to try and stop either from happening. I’m not sure what I expected here, because physics was strongly suggesting at this point that something had to give. That something turned out to be the entire ethernet card of the computer, which was only lightly screwed in to the machine. The card flew right through my hands and across the room. Linux apparently has no problem with an idiot teenager yanking its network card out, because it kept running with no problem whatsoever. I stopped my program, plugged the card back in, and kept working without even having to reset the machine. No harm done, amazingly. Now, THAT is robust software.
When I was an undergrad, I left the ARL and started working for my uncle, who started a foreign currency trading shop. He and I wrote all the software for the company (it was a small company), which mainly consisted of one of those programs you see running on like 6 monitors on a trading floor with charts all over the place. This was a cool job because the programming was really hard: the screen (and all of the underlying analysis) had to update as soon as possible after a new trade came in from the exchange, so everything had to be really low-latency. We don’t usually thing about latency in programming outside of embedded systems or real-time instrument control. I learned a great deal about writing fast code.
After working on the finance stuff, I took an engineering job in California, but decided pretty soon after that I wanted to become a scientist instead, and went back to get my PhD.
What hardware do you use?
I use a mac pro for my desktop and a macbook pro on the road. My lab has a cluster with 7 nodes, each with 64 cores and 512GB of RAM. The cluster’s brand new and hasn’t seen much action yet, but I’m sure that will change.
What is your text editor?
I like Xcode for C++ development (the debugger interface is the best one I’ve seen short of the Visual C++ debugger, which is amazing), and TextMate for everything else.
What software do you use for your work?
I mostly analyze RNA-Seq data, and now mostly single cell. I use TopHat, Cufflinks, and CummeRbund for RNA-Seq, Monocle for single-cell RNA-Seq, and Bowtie for general read mapping. UCSC for most genome browsing, IGV to look at BAMs. Bedtools, and lots of awk and python scripts for tearing through the various files Bedtools produces. I usually write a gross pile of R code at the end of each project to tie it all together, but it’s getting less gross now that I’m switching over to using R markdown/knitr.
What do you use to create plots and charts?
ggplot2, almost exclusively, but I’ve got my eye on ggvis.
What do you consider the best language to do bioinformatics with?
Hmm, this is a tough one. Recent history strongly suggests that if a program is to be widely used, it needs to be both fast and consume reasonable amounts of RAM. I shoot for a footprint of less than 4GB per instance, though this isn’t always possible (at least possible for me). The rationale for that is that although its easy to get a machine these days with far more than that, many cluster installations have nodes that have as little as 16 GB. Aiming for 4 keeps a program well within the range that you can parallelize many instances of it across a cluster, which helps when you are say, mapping 100 FASTQ files at once.
Some languages are faster than others. Usually, the more a language allows you to defer decisions until runtime through dynamically generated code as a core language feature, the slower it is. C++, for example, has limited support for this kind of thing, whereas R has loads. Some languages also give you little or no control over how much memory is consumed at peak usage or how that memory is organized and accessed. This creates huge problems for placing upper bounds on how much memory your program will consume.
C/C++ are one of several languages that are very fast and give you total control over memory. The tradeoffs are that writing new code in C++ can be slow because there’s a lot of detail, and there’s far fewer bioinformatics libraries than for R or Python, so you have to write more from scratch.
So I guess I would say when performance matters, C++ is my language of choice. When it doesn’t as much, I use whatever lets me write (and thus test) the least amount of code, and that usually means R these days.
What bioinformatics tools/software do not get enough recognition?
There are many, many tools that solve hard problems well that fall into this category. Just a couple of examples: I have found the chromatin state maps produced by programs like ChromHMM and Segway to be extremely helpful.
I also find that the Fast Statistical Aligner (FSA) give me insanely good results on hard alignment problems. And while they aren’t specifically for bioinformatics, I would have to say that awk and plyr/dplyr are hands down some of the most useful pieces of science software ever written.
See all post in this series https://www.biostars.org/t/uses-this/
To be notified of a new post in the series follow the first post: Jim Robinson of the Integrative Genomics Viewer (IGV) uses this