File Chameleon, easily transform Ensembl FTP files

written 1 day ago by Ensembl Blog

Transforming file formats has always been a troublesome issue in bioinformatics because of the numerous standards and slight eccentricities in formatting required by some software packages. How many times have you needed to transform chromosome names between 1,2,3 and chr1, Continue reading File Chameleon, easily transform Ensembl FTP files→

Bio-Rad Sips Up RainDance

written 2 days ago by Omics! Omics! by Keith Robinson

Monday evening brought news that Bio-Rad has further consolidated its grip on the droplet microfluidics space by acquiring RainDance Technologies for an undisclosed price. Bio-Rad had previously acquired droplet digital PCR company QuantaLife back in October of 2011 and targeted sequencing company GnuBio in April of 2014. While the droplet digital PCR has been marketed for many years now, the GnuBio effort had gone relatively quiet since the acquisition. However, Bio-Rad announced the JP Morgan conference that this technology will be launched as OncoDrop late this year.Read more »

A citation is not a citation is not a citation

written 2 days ago by Bits of DNA by Lior Pachter

Three years ago Nicolas Bray and I published a post-publication review of the paper “Network link prediction by global silencing of indirect correlations” (Barzel and Barabási, Nature Biotechnology, 2013). Despite our less than positive review of the work, the paper has gone on to garner 95 citations since its publication (source: Google Scholar). In fact, in just this past […]

Computational postdoc opening at UC Davis!

written 2 days ago by Living in an Ivory Basement by Titus Brown

We are currently soliciting applications for computational postdoctoral fellows to undertake exciting projects in computational biology/bioinformatics jointly supervised by Dr. Titus Brown ( and Dr. Fereydoun Hormozdiari ( at UC Davis. UC Davis is a world class research institution with a strong genomics faculty. In addition to being part of Dr. Brown and Dr. Hormazdiari's labs, the postdoc will be able to participate in Genome Center activities. Potential collaborators include Megan Dennis, Alex Norde, and Paul and Randi Hagerman. UC Davis is close to the Bay Area and there will be opportunities to connect and collaborate with researchers at Berkeley, Stanford, and UCSF as well. Davis, CA is an excellent place to live with good food, great schools, nice weather, non-Bay-Area housing prices, and a bike-friendly culture. --- The successful candidate will undertake computational method and tool development for better understanding the contribution of genetic variation (especially structural variation) on changing the genome structure. In collaboration with the members of both labs, the postdoctoral candidate will also be building models for predicting the changes in gene expression based on variants (especially CNV) and performing a comparative study of genome structures in multiple tissues/samples using HiC data. This opportunity requires developing novel computational algorithms and machine learning methods to solve emerging biological problems. The technical expertise needed include strong computational background to develop novel combinatorial, machine learning (ML) or statistical inference algorithms, with strong programming capabilities and a general understanding of the concepts in genomics and genetics. Candidates are guaranteed funding ...

Categorizing 400,000 microbial genome shotgun data sets from the SRA

written 3 days ago by Living in an Ivory Basement by Titus Brown

This is another blog post on MinHash sketches; see also: Applying MinHash to cluster RNAseq samples MinHash signatures as ways to find samples, and collaborators? Efficiently searching MinHash Sketch collections Quickly searching all the microbial genomes, mark 2 - now with archaea, phage, fungi, and protists! What metadata should we put in MinHash Sketch signatures? Contents How do you dance lightly across the surface of 400,000 data sets? Categorizing 400,000 sourmash signatures... quickly! Results! What are the results?! What next? Backing up -- why would you want to do any of this? A few months ago I was at the Woods Hole MBL Microbial Diversity course, and I ran across Mihai Pop, who was teaching at the STAMPS Microbial Ecology course. Mihai is an old friend who shares my interest in microbial genomes and assembly and other such stuff, and during our conversation he pointed out that there were many unassembled microbial genomes sitting in the Sequence Read Archive. The NCBI Sequence Read Archive is one of the primary archives for biological sequencing data, but it generally holds only the raw sequencing data; assemblies and analysis products go elsewhere. It's also largely unsearchable by sequence: you can search an individual data set with BLAST, I think, but you can't search multiple data sets (because each data set is large, and the search functionality to handle it doesn't really exist). There have been some attempts to make it searchable, including most notably Brad Solomon and Carl Kingsford's Sequence Bloom Tree paper ...

Creating a custom GATK Walker (GATK 3.6) : my notebook

written 5 days ago by YOKOFAKUN by Pierre Lindenbaum

This is my notebook for creating a custom engine in GATK. Description I want to read a VCF file and to get a table of category/count. Something like this: HAVE_ID TYPE COUNT YES SNP 123 NO SNP 3 NO INDEL 13 Class Category I create a class Category describing each row in the table. It's just a List of Strings static class Category implements Comparable {

RStudio Conference 2017 Recap

written 6 days ago by Getting Genetics Done by Stephen Turner

The first ever RStudio conference was held January 11-14, 2017 in Orlando, FL. For anyone else like me who spends hours each working day staring into an RStudio session, the conference was truly excellent. The speaker lineup was diverse and covered lots of areas related to development in R, including the tidyverse, the RStudio IDE, Shiny, htmlwidgets, and authoring with RMarkdown. This is not a complete list by any means — with split sessions I could only go to half the talks at most. Here are some noncomprehensive notes and links to slides and resources for some of the awesome things are doing with R and RStudio that I learned about at the RStudio Conference.Hadley Wickham kicked off the meeting with a keynote on doing data science in R. The talk focused on the tidyverse, and the notion of splitting functions into commands that do something, as compared to queries that calculate something, and how it’s generally a good idea to keep these different functionalties contained in their own separate functions. (Contrast this to things like lm that both computes values and does things, like printing those values to the screen, making it difficult to capture (see broom). I asked Hadley after his talk about strategies to reduce issues getting Bioconductor data structures to play nicely with tidyverse tools. Within minutes David Robinson released a new feature in the fuzzyjoin package that leverages IRanges within this tidyverse-friendly package for efficiently doing things like joining on genomic intervals.Another #rstudioconf-inspired addition to ...

AGBT 2017 Agenda is Here

written 6 days ago by Next Gen Seek

Advances in Genome Biology and Technology (AGBT) General Meeting, one of the premiere events focusing on genomic technology, which starts on 13th February, has announced the full agenda for the meeting. The schedule looks exciting and seems to be as usual and here is some of the talks to look forward to. ANDREW ADEY, Oregon […]

Exome or Whole-genome Sequencing for Mendelian Disorders

written 8 days ago by MassGenomics by Dan Koboldt

Exome sequencing has undeniably transformed the study of rare inherited disorders, enabling the rapid identification of hundreds of new diseases genes in the past few years and spurring the adoption of clinical exome sequencing as a frontline diagnostic tool. That’s great news. Hooray for the exome! Is it a fantastic discovery tool? Absolutely. But it’s […]

Illumina Unveils HiSeq Successor NovaSeq

written 10 days ago by Omics! Omics! by Keith Robinson

At today's J.P. Morgan Healthcare Conference Illumina made a number of small announcements -- some new partnerships, Firefly on track for launch later this year, launch of the single cell workflow partnered with Bio-Rad. Then CEO Francis deSouza dropped the big news: a new high-end sequencer architecture to ultimately replace all of the HiSeq instruments. It sounds like an interesting evolution of the Illumina product line, but unfortunately too many headlines and tweets have focused on a distant goal of $100 human genomes. Worse, not only did some commentators misconstrue the announcement as delivering on $100 genomes, but some also touted a sequencing speed of one hour for a genome which isn't remotely true. Read more »

Illumina’s announces new sequencers: NovaSeq Series

written 10 days ago by Next Gen Seek

Illumina announced new sequencers, NovaSeq 5000 and 6000, that could potentially reduces the cost of human genome sequencing to $100 in the future. The NovaSeq sequencing systems, which are built from the ground up, offers scalability and flexibility for any type of sequencing, either large scale or targeted sequencing. NovaSeq 5000 and 6000 NovaSeq 5000 […]

HiSeq move over, here comes Nova! A first look at Illumina NovaSeq

written 10 days ago by Opinionomics by Mick Watson

Illumina have announced NovaSeq, an entirely new sequencing system that completely disrupts their existing HiSeq user-base. In my opinion, if you have a HiSeq and you are NOT currently engaged in planning to migrate to NovaSeq, then you will be out of business in 1-2 years time. It’s not quite the death knell for HiSeqs, […]

How I learned to stop worrying and love the coming archivability crisis in scientific software

written 10 days ago by Living in an Ivory Basement by Titus Brown

Note: This is the fifth post in a mini-series of blog posts inspired by the workshop Envisioning the Scientific Paper of the Future. This post was put together after the event and benefited greatly from conversations with Victoria Stodden, Yolanda Gil, Monya Baker, Gail Peretsman-Clement, and Kristin Antelman! Archivability is a disaster in the software world In The talk I didn't give at Caltech, I pointed out that our current software stack is connected, brittle, and non-repeatable. This affects our ability to store and recover science from archives. Basically, in our lab, we find that our executable papers routinely break over time because of minor changes to dependent packages or libraries. Yes, the software stack is constantly changing! Why?! Let me back up -- Our analysis routines usually depend on an extensive hierarchy of packages. We may be writing bespoke scripts on top of our own library, but those scripts and that library sit on top of other libraries, which in turn use the Python language, the GNU ecosystem, Linux, and a bunch of firmware. All of this rests on a not-always-that-sane hardware implementation that occasionally throws up errors because x was compiled on y processor but is running on z processor. We've had every part of this stack cause problems for us. Three examples: many current repeatability stacks are starting to rely on Docker. But Docker changes routinely, and it's not at all clear that the images you save today will work tomorrow. Dockerfiles (which provide the instructions for ...

The talk I didn't give at Caltech (Paper of the Future)

written 10 days ago by Living in an Ivory Basement by Titus Brown

Note: This is the fourth post in a mini-series of blog posts inspired by the workshop Envisioning the Scientific Paper of the Future. This is an outline of the talk I didn't give at Caltech, because I decided that Victoria Stodden and Yolanda Gil were going to cover most of it and I would rather talk about a random collection of things that they might not talk about. (I think I was 7 for 10 on that. ;) This is in outline-y form, but I think it's fairly understandable. Ask questions in the comments if not! What will the paper of the future look like? A few assertions about the scientific paper of the future: The paper of the future will be open - open access, open data, and open source. The paper of the future will be highly repeatable. The paper of the future will be linked. The paper of the future will not depend on expensive infrastructure. The paper of the future will be commonplace. The paper of the future will be archivable (or will it? Read on.) What's our experience with the paper of the future been? My lab (and many, many others) have been doing things like: Automating the entire analysis from raw data to conclusion. Publishing data narratives and notebooks. Using version control for paper and data notebook and source code. Anointing data sets with DOIs. Posting virtual environments &amp; execution specifications for papers. We've been doing parts of this for many years, and while ...

Pondering What Is Lost In Teaching Translation

written 11 days ago by Omics! Omics! by Keith Robinson

I'm good at acquiring distractions, and a relatively new one is Quora. This site allows users to ask questions which are then answered by members of the community. I lurk in a number of fields, but have answered a few questions related to genomics and related fields of biology. Tackling a question last night required re-learning some details I was disappointed I had forgotten. In researching to regain that knowledge, I skimmed a number of study guides online, which leads to this post.Read more »

Three lessons from peer coaching

written 11 days ago by The Grand Locus

For a little more than a year, my colleagues and me have been organizing peer coaching sessions for junior group leaders. A typical session consists of four to six of us, and we meet for one morning to discuss the most pressing issues. After a start-up training and some trial and error, we settled for a group coaching method that gave the best result. To give an idea, the “coachee” tells the chairman what he/she wants to solve, then follows a discussion where he/she explains the facts to the coaches who ask as many question as possible. Then the coaches analyze the situation, suggest solutions and make comments meanwhile the coachee has to remain silent and listen. Finally, the coachee summarizes what he/she heard and what steps he/she will take. With this exercise, we learned a great deal about how to organize such peer coaching sessions in the academia and how to make the best of them, but this is not what this post is about. Instead, I would like to share more important lessons I have learned about working together and using the group as support and source of motivation. I you... Read more on the blog: Three lessons from peer coaching

#JPM17 Genomics and Synthetic Biology Companies

written 12 days ago by Omics! Omics! by Keith Robinson

With the 2017 J.P. Morgan Conference in Healthcare (#JPM17) starting Monday, I and others have engaged in early reporting or speculation. I've tried to compile a list of presenting companies in the genomics, informatics and synthetic biology tool spaces, but these were filtered quickly from a long list of presenting companies so I may have missed some -- please leave comments and I can add. Also, some of the big conglomerates could speak on these topics but might ignore them, so no promises. For example, Roche has their pharmaceutical CEO speaking, so we may not hear anything about the PacBio breakup or Genia lawsuit. All times are Pacific Standard Time and are from the J.P. Morgan, though I've converted to 24-hour time (hopefully successfully!). You may need to register with J.P. Morgan to follow the links I've provided and access the webcasts when they are available. Read more »

Topics and concepts I'm excited about (Paper of the Future)

written 12 days ago by Living in an Ivory Basement by Titus Brown

Note: This is the third post in a mini-series of blog posts inspired by the workshop Envisioning the Scientific Paper of the Future. I've been struggling to put together an interesting talk for the workshop, and last night Gail Clement (our host, @Repositorian) and Justin Bois helped me convinced myself (using red wine) that I should do something other than present my vision for #futurepaper. So, instead, here is a set of things that I'm pretty excited about in the world of scholarly communication! I've definitely left off a few, and I'd welcome pointers and commentary to things I've missed; please comment! 1. The wonderful ongoing discussion around significance and reproducibility. In addition to Victoria Stodden, Brian Nosek and John Ioannidis have been leaders in banging various drums (and executing various research agenda) that are showing us that we're not thinking very clearly about issues fundamental to science. For me, the blog post that really blew my mind was Dorothy Bishop's summary of the situation in psychology. To quote: Nosek et al have demonstrated that much work in psychology is not reproducible in the everyday sense that if I try to repeat your experiment I can be confident of getting the same effect. Implicit in the critique by Gilbert et al is the notion that many studies are focused on effects that are both small and fragile, and so it is to be expected they will be hard to reproduce. They may well be right, but if so, the solution ...

Data implies software.

written 13 days ago by Living in an Ivory Basement by Titus Brown

Note: This is the second post in a mini-series of blog posts inspired by the workshop Envisioning the Scientific Paper of the Future. An important yet rarely articulated assumption of a lot of my work in biological data analysis is that data implies software: it's not much good gathering data if you don't have the ability to analyze it. For some data, spreadsheet software is good enough. This was the situation molecular biology was in up until the early 2000s - sure, we'd get numbers from instruments and sequences from sequencers, but they'd all fit pretty handily in whatever software we had lying around. Once numerical data sets get big enough -- e.g. I did approximately 50,000 qPCRs in my last two years of grad school, which was unpleasant to handle in Excel -- we need to invest in software like R or Python, which can do bulk and batch processing of the data. Software like OpenRefine can also help with "manual" cleaning and rationalization of the data. But this requires skills that are still relatively specialized. For other data, we need custom software built specifically for that data type. This is true of sequence analysis, where most of my work is focused: when you get 200m DNA sequences, each of length 150 bp, there's no simple, effective way to query or summarize that using general computational tools. We need specialized code to parse, summarize, explore, and investigate these data sets. Using this code doesn't necessarily require serious programming knowledge, ...

Two Pore Guys Previews Handheld Nanopore Analyte Sensor Ahead of J.P. Morgan Conference

written 14 days ago by Omics! Omics! by Keith Robinson

2017 is certainly shaping up to be a big year for nanopore news. I touched on Oxford Nanopore's very full plate in my speculation about sequencing platforms and we already know of two different legal actions which will be progressing, PacBio vs. Oxford Nanopore and University of California vs. Genia. James Hadfield's take on possible Illumina announcements at the J.P. Morgan Conference includes an Illumina nanopore device. That's speculation; today we had a pair of tweets from Two Pore Guys previewing their sensing device and that they will be talking more at J.P. Morgan (all videos from 2PG).See the first public demo of our #nanopore device doing a sample-to-result HIV test!— Two Pore Guys (@TwoPoreGuys) January 4, 2017.@TwoPoreGuys See us a #JPM17— Two Pore Guys (@TwoPoreGuys) January 4, 20172PG Demo Video - HIV from Two Pore Guys on Vimeo.Read more »

The top 10 reasons why blog posts are better than scientific papers

written 14 days ago by Living in an Ivory Basement by Titus Brown

Note: This is the first post in what I hope to be a mini-series of blog posts inspired by the workshop Envisioning the Scientific Paper of the Future. 1. Blog posts are like preprints, but faster. Even preprints go through some review before they're posted, just to make sure they're not obviously crank papers. Blog posts don't suffer from any prior restraint other than the need to take the time to write them. 2. Blog posts don't end up in PDFs. ...and you don't have to write them in nasty complex formats like Word or LaTeX. Reference: why PDFs suck. 3. Blog posts are like papers, but better written. Blog posts can be colloquial, funny, and sarcastic - unlike scientific papers. Blog posts can also contain narrative in a way that scientific papers simply don't. 4. Blog posts are often opinionated. Papers go through multiple rounds of review and revision, in which the naturally irregular and uneven surface of reality is sanded down and/or bludgeoned into a cuboid that looks and sounds objective and impartial. Blog posts suffer from no such fiction of objectivity and impartiality. (Self-referential case in point.) 5. Blog posts inspire feedback. Perhaps in part because blog posts convey personal opinion, blog posts are inherently more social, more interactive, and more open to commentary. (Presumably this will also be a self-referential case in point. Or not, which would be awesomely ironic!) 6. Blog posts are free, open access, and indexed by search engines. Kind of like preprints, ...

So you want to run an Ensembl workshop

written 15 days ago by Ensembl Blog

We think the Ensembl workshops that we offer are a brilliant way to familiarise yourself, and other people in your research institute, with Ensembl data and tools. Don’t take our word for it, over 99% of the people who attended our Continue reading So you want to run an Ensembl workshop→

University of California Cries "Thief!" on Genia Patents

written 16 days ago by Omics! Omics! by Keith Robinson

As I noted in my last post, the University of California has filed suit against Genia claiming that Genia co-founder Roger Chen misappropriated intellectual property from UC Santa Cruz and the laboratory of Mark Akeson (filings include a bunch of other well-known nanopore scientists, including David Deamer and Dan Branton). While the filings are mostly dry, they are enlivened occasionally by such colorful language as "evasive tactics", "aided and abetted" and "stonewalled". Goaded by Mick Watson, I've dug into the court filings and some of the patents (and obtaining those filings apparently cost me some real money, perhaps approaching $1.0e01 dollars).Read more »

Sequencing Technology Outlook, January 2017

written 17 days ago by Omics! Omics! by Keith Robinson

Another year of blogging is upon us! Since the J.P. Morgan Conference starts a week from today and then before long it's time for AGBT. So if one is going to prognosticate, then there's no time to lose, as announcements could start flying at any time.Read more »

reorder boxplot according to median

written 17 days ago by Diving into Genetics and Genomics

reorder factors for boxplotIt is very common that you want to reorder the boxplot according to the medians to see a better trend. I will show you how to do it using ggplot2 and the forcats packages which are developed by Hadely Wickham.Read his new R for data science book: Sepal.Length&lt;dbl&gt;Sepal.Width&lt;dbl&gt;Petal.Length&lt;dbl&gt;Petal.Width&lt;dbl&gt;Species&lt;fctr&gt; rowsa basic boxplot:Hideggplot(iris, aes(x = Species, y = Sepal.Width)) + geom_boxplot()Now, reorder it according to the median levels.Hideggplot(iris, aes(x = reorder(Species, Sepal.Width, FUN = median), y = Sepal.Width)) + geom_boxplot()use the fct_reorder function from forcats. it has a similar syntax with reorder, but note that the argument fun is lower case.Hidelibrary(forcats)ggplot(iris, aes(x = fct_reorder(Species, Sepal.Width, fun = median), y = Sepal.Width)) + geom_boxplot()You can change the order from high to lowHideggplot(iris, aes(x = fct_reorder(Species, Sepal.Width, fun = median, .desc =TRUE), y = Sepal.Width)) + geom_boxplot()some touch-upsHideggplot(iris, aes(x = fct_reorder(Species, Sepal.Width, fun = median, .desc =TRUE), y = Sepal.Width)) + geom_boxplot(aes(fill = Species)) + geom_jitter(position=position_jitter(0.2)) + theme_bw(base_size = 14) + xlab("Species") + ylab("Sepal width")HideNAIf you want to fill the colors mannually with colors from RcolorBrewer:Hidelibrary(RColorBrewer)ggplot(iris, aes(x = fct_reorder(Species, Sepal.Width, fun = median, .desc =TRUE), y = Sepal.Width)) + geom_boxplot(aes(fill = Species)) + scale_fill_manual(values = brewer.pal(3, "Dark2")) + geom_jitter(position=position_jitter(0.2)) + theme_bw(base_size = 14) + xlab("Species") + ylab("Sepal width")A different theme:Hideggplot(iris, aes(x = fct_reorder(Species, Sepal.Width, fun = median, .desc =TRUE), y = Sepal.Width)) + geom_boxplot(aes(fill = Species)) + scale_fill_manual(values = brewer.pal(3, "Dark2")) + geom_jitter(position=position_jitter(0.2)) + theme_classic(base_size = 14) + xlab("Species") + ylab("Sepal width")
