Exploring the 1000 genome dataset with Hail on Amazon EMR and Amazon Athena

written 2 days ago by Kevin's GATTACA World

Blog post from Roy Hasson analysis has taken off in recent years as organizations continue to adopt the cloud for its elasticity, durability, and cost. With the AWS Cloud, customers have a number of performant options to choose from. These options include AWS Batch in conjunction with AWS Lambda and AWS Step Functions; AWS Glue, a serverless extract, transform, and load (ETL) service; and of course, the AWS big data and machine learning workhorse Amazon EMR.For this task, we use Hail, an open source framework for exploring and analyzing genomic data that uses the Apache Spark framework. In this post, we use Amazon EMR to run Hail. We walk through the setup, configuration, and data processing. Finally, we generate an Apache Parquet–formatted variant dataset and explore it using Amazon Athena.

What’s coming in Ensembl 92 and Ensembl Genomes 39

written 3 days ago by Ensembl Blog

Both Ensembl release 92 and Ensembl Genomes release 39 are scheduled for April 2018. Included are new genomes and genebuilds (Goat, Zebrafish, Marmoset, Stalk-eyed fly and Waterflea) and a command line version of our new Linkage Disequilibrium tool. Here are the highlights you can look forward to: Ensembl 92 New assemblies, gene sets and annotations […]

AGBT: It Ain't Over 'til the Tattoo Wears Off

written 5 days ago by Omics! Omics! by Keith Robinson

AGBT officially ended on Thursday night with a space-themed party, but I have a bunch of notes from interviews with company representatives and even a few notes from sessions. So be prepared for a string of further AGBT reports. This dispatch will have some overall thoughts as well as some notes on the possible return of AGBT to Marco Island next year. I also want to mention two good AGBT 2018 summaries, one from Dale Yuzuki and another from Decibio's Stephane Budel.newly appliedRead more »

GDE² = DGE² + DTU² = DTE₁² + DTE₂²

written 9 days ago by Bits of DNA by Lior Pachter

The development of microarray technology two decades ago heralded genome-wide comparative studies of gene expression, but it was the widespread adoption of RNA-Seq that has led to differential expression analysis becoming a staple of molecular biology studies. RNA-Seq provides measurements of transcript abundance, making possible not only gene-level analyses, but also differential analysis of isoforms of […]

Do software and data products advance biology more than papers?

written 9 days ago by Living in an Ivory Basement by Titus Brown

Software and data good.

Waterman’s egg

written 9 days ago by Bits of DNA by Lior Pachter

I recently published a paper on the bioRxiv together with Vasilis Ntranos, Lynn Yi and Páll Melsted on Identification of transcriptional signatures for cell types from single-cell RNA-Seq. The contributions of the paper can be summed up as: The simple technique of logistic regression, by taking advantage of the large number of cells assayed in […]

AGBT: BioNano Launches New Labeling Approach

written 10 days ago by Omics! Omics! by Keith Robinson

AS AGBT opened, optical mapping company BioNano Genomics announced a new scheme for labeling genomic DNA inputs which substantially improves performance. Sven Bocklandt from the company sat down with me yesterday to walk through the new Direct LabelingRead more »

AGBT: Twist Biosciences Launches Sequence Capture Product

written 10 days ago by Omics! Omics! by Keith Robinson

Twist Biosciences today launched a new product into the sequence capture space. CEO Emily Leproust was presenting to the Gold Sponsor workshop as I started writing this, but she also sat down with me yesterday to preview the new offering for targeted sequencing.Read more »

The End of MassGenomics

written 11 days ago by MassGenomics by Dan Koboldt

I started MassGenomics ten years ago, when so-called next-generation sequencing was still in its infancy. I’d joined the Genome Sequencing Center at Washington University, fulfilling a dream I had since high school. At the time, two NGS technologies had begun to emerge: 454 pyrosequencing and Solexa sequencing-by-synthesis. Over the next several years, Solexa was acquired […]

AGBT: 10X Previews Three New Single Cell Applications

written 11 days ago by Omics! Omics! by Keith Robinson

I spent breakfast with 10X Genomic's Michael Schnall-Levin and two of his 10X colleagues gave me a sneak peak at three new single cell products they are rolling out at the workshop I'm typing away at now. These enable measuring protein targets of antibodies, mapping out accessible chromatin regions with ATAC-Seq, and mapping copy number variants (CNVs) at single cell resolution. All use the existing Chromium Controller instrument.Read more »

AGBT 2018: It's Great to Be Back

written 12 days ago by Omics! Omics! by Keith Robinson

All sorts of scheduling snafus have kept me away the past three years. So this time around, I vowed to go and made sure my calendar stayed clear. So clear, I forgot to put a reminder down to actually register for the event. Luckily, there were slots still available when I put my flier in.Read more »

AGBT Swag Bag

written 12 days ago by Omics! Omics! by Keith Robinson

Today at AGBT is light on the science talks; the afternoon is free for lazing around the resort complex -- or for swimming laps in the lazy river (which makes it a not-so-lazy-river). I can only manage downstream; upstream is an aquatic treadmill. A key task on Day 1 is to pick up one's registration materials. At one conference I failed to do this promptly and discovered to my dismay that the desk wasn't open during the opening reception slash poster session -- so despite being a speaker I had to sneak into the room via a side door! Registering means picking one's meal pass -- I took the temporary tattoo over the wristband option -- and grabbing the vaunted AGBT backpack.Read more »

Brown Webcast Note: Corrections and Expansions

written 13 days ago by Omics! Omics! by Keith Robinson

After I post something, there's almost always something I realize I left out. In my piece on Clive Brown's webcast of ONT improvements, not only did I forget a few key details but my wording led to some unfortunate confusion, as judged by a comment. Someone took me up on my idea on how detecting large fragments during a run might work -- and showed it doesn't pan out (which Clive Brown confirmed). And to top things off, a BioRxiv preprint showed up that exactly covered something I alluded to.Read more »

The curse of large numbers (Big Data considered harmful)

written 14 days ago by The Grand Locus

According to the legend, King Midas got the sympathy of the Greek god Dionysus who offered to grant him a wish. Midas asked that everything he touches would turn into gold. At first very happy with his choice, he realized that he had brought on himself a curse, as his food turned into gold before he could eat it. This legend on the theme “be careful what you wish for” is a cautionary tale about using powers you do not understand. The only “powers” humans ever acquired were technologies, so one can think of this legend as a warning against modernization and against the fact that some things we take for granted will be lost in our desire for better lives. In data analysis and in bioinformatics, modernization sounds like “Big Data”. And indeed, Big Data is everything we asked for. No more expensive underpowered studies! No more biased small samples! No more invalid approximations! No more p-hacking! Data is good, and more data is better. If we have too much data, we can always throw it away. So what can possibly go wrong with Big Data? Enter the Big Data world and everything you touch turns... Read more on the blog: The curse of large numbers (Big Data considered harmful)

Ensembl Plants Project Leader

written 15 days ago by Ensembl Blog

We’re looking for a bioinformatician to lead our Ensembl Plants team, working with collaborators to import, analyse and integrate plant genomic data. We’re looking for five years or more experience in bioinformatics, preferably plant genomics, using NGS data. Closes 8th March. Location: EMBL-EBI Hinxton near Cambridge, UK Staff Category: Staff Member Contract Duration: 3 years […]

Getting to know us – Carla from Compara

written 15 days ago by Ensembl Blog

This month we’re meeting Carla from our comparative genomics team (which we call compara). What is your job in Ensembl? I am a developer in the comparative genomics team. Our job is to compute any resource in Ensembl that involves comparing species. These include whole genome alignments, gene trees and homology predictions. What do you […]

February 2018 Clive Brown Webcast Notes

written 15 days ago by Omics! Omics! by Keith Robinson

Clive Brown's webcasts are always entertaining, and even the 6am Eastern Time start for Thursday's didn't hinder that aspect -- though I am thankful I'm not on the U.S. West Coast because I really don't function at 4am. Even at 6am, I was frequently shutting off my iPad screen or exiting the presentation, as screenshots on iOS involve simultaneously pressing Power and Home keys. At that hour, my never great fine motor skills just aren't reliable. Hopefully I won't make the dog's breakfast of this, as that's usually all I'm good for processing at that hour!Still, lots of updates and promises as well as a number of "wait until London Calling" teasers. Just to get this out of the way, I'm going to report the launch dates that Oxford mentioned -- anyone in this space should know that Oxford is very good at delivering what they promise, but not very good at delivering when they promise. You can also find notes by David Eccles to check me against or watch the presentation recording from ONT.Read more »

Bioinformatician – GENCODE

written 16 days ago by Ensembl Blog

We’re looking for a bioinformatician to work on comparing and building a consensus between our GENCODE genes with RefSeq. We’re looking for degrees in genetics or biological science with an understanding of genomics and experience in Perl and relational databases. Closes 4th March. Location: EMBL-EBI Hinxton near Cambridge, UK Staff Category: Staff member Contract Duration: […]

Oxford Nanopore Outlook 2018

written 17 days ago by Omics! Omics! by Keith Robinson

I'm behind on these posts. My usual foibles were largely responsible for a while, but then I had the major (and sad) family issue that has kept me off balance for two weeks. Someday I may write about that, but for now back to the major sequencing vendors. Though with Oxford Nanopore, the problem is where to start? But now is the time to get moving, both since Oxford's Clive Brown will be webcasting an update on Thursday and I'll be at AGBT next week and expect to be busy with news flow from that event. Clive's webcast is titled "sub-$1000 human genomes on Nanopore (and other goodies for H1 2018), so expect quite a casserole of tempting updates. Certainly it is enough to get me to try to be fully mentally awake at 6 am, something that does not come naturally.Read more »

UniProt and the Expanding Tree of Life

written 18 days ago by Inside UniProt

UniProt loves life in all its forms, but we especially love its complement of proteins. We want to bring you the protein sequences from the massive diversity of organisms across the whole planet. We have been closely following how the Tree of Life is expanding and being increasingly accurately resolved. Here's a look at a couple of the most exciting discoveries and how they are reshaping what we do. Below is a revised Tree of Life presented by Laura Hug et al., which is based upon an alignment of 16 ribosomal proteins. Figure 1. The revised Tree of Life from Hug et al. 2016. Lineages lacking an isolated representative are highlighted with non-italicized names and red dots.We can see that the large majority of organisms are microbial, and as yet we are unable to grow a large fraction of them in the lab. Red dots in the figure show phyla for which not even a single organism has been cultured. However, due to the power of next generation sequencing and improving metagenomic assembly and binning tools, we now have access to thousands of complete or near complete genomes assembled from metagenomic data (Anantharaman et al. 2016, Parks et al. 2017). These genomes have been called MAGs for metagenomic assembled genomes. Probably the most exciting MAG to have been assembled is that of an enigmatic archaebacterium that lives in deep sea sediments. Lokiarchaebacterium is named after the location at which it was first identified (Spang et al. 2015); the Loki’s Castle ...

convert a human gmt file to mouse for GSEA

written 18 days ago by Diving into Genetics and Genomics

Blog downtime, Weds 7th Feb

written 19 days ago by Ensembl Blog

This blog will be offline for a short while for maintenance on the morning of Wednesday 7th February. All other services will be unaffected. We will also be implementing a new template and reorganising some content, so your bookmarks may Continue reading Blog downtime, Weds 7th Feb→

Ensembl FTP and mirror websites downtime, Tue 6th Feb

written 19 days ago by Ensembl Blog

The Ensembl FTP site ( will be unavailable on 06/02/18 between 0900 and 1400 GMT (UTC) for hardware upgrades. The Ensembl mirror websites (, and will also be unavailable during this period. All other services will be unaffected. Continue reading Ensembl FTP and mirror websites downtime, Tue 6th Feb→

Fingerprints on Jupiter

written 25 days ago by Omics! Omics! by Keith Robinson

I had hoped to mark my father's 93rd birthday today in my usual way, a call home to exchange well wishes and update him on our goings-on. But two weeks ago he entered the hospital for what turned out to be a final visit, so instead I am writing this.Read more »

Bioinformatics on a Rock64

written 26 days ago by Bits of DNA by Lior Pachter

I have been fascinated with mini computers for some time, and have wondered when they will become suitable for bioinformatics. The 4273π project, which is an online course that is distributed as a 32Gb SD card image for the Raspberry Pi, has been around for a few years and demonstrated the utility of mini computers for […]
