Question: Why is Hadoop not used a lot in bio-informatics?
gravatar for William
6.0 years ago by
William4.7k wrote:

Why is Hadoop not used a lot in bio-informatics?  At least in my experience I don't see Hadoop being used at local research groups or at the well known and well funded research groups in the UK and USA. Though Hadoop offers completely distributed IO and CPU power which should be very attractive for large bio-informatics data analysis.

Is it that that the type of files are not suitable for Hadoop? For example a 1000 large binary BAM files each 100GB ? Can Hadoop work with binary files of that size?

Or is it that the common tools like BWA, Picard, HTS-JDK and GATK can't be run natively on Hadoop?

Mapping 1000 FastQ files to Sam files is something which can be done in parallel for every record in the FastQ files and is I think well suited for Hadoop.

But is Mapping entire Bam files to gVCF  (mapping as in the functional MapRecure paradigm) something that can be done on Hadoop?

And is reducing ( reducing as in the functional MapRecure paradigm) the gVCF files to VCF files something that can be done on Hadoop?

As you might have gathered I only have a limited knowledge of Hadoop and the Hadoop file system ( HDFS) and I am wondering if they are suitable for common genomics data formats and common analysis steps like mapping reads with for example BWA and variant calling with for example GATK Haplotype Caller.


I found  Hadoop-BAM and SeqPig but I am wondering if these are just papers / technical proof of concepts or if they also see any real world use?

hadoop bioinformatics • 17k views
ADD COMMENTlink modified 10 months ago by snahta0 • written 6.0 years ago by William4.7k

You might be interested in this thread if you haven't already seen it Distributed Computing In Bioinformatics

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by dariober11k
gravatar for lh3
6.0 years ago by
United States
lh332k wrote:

Most of applications you mentioned can be and have already been implemented on top of hapdoop. A good examples is the ADAM format, a hapdoop friendly replacement of BAM, and its associated tools. They are under active development by professional programmers. Nonetheless, I see a few obstacles to its wider adoption:

  1. It is harder to find a local hadoop cluster. My impression is that hadoop really shines in large scale cloud computing where we have a huge (virtual) pool of resources and can respond users on demand. In a multi-user environment given limited resources, I don't know if a local hadoop is as good as LSF/SGE in terms of fairly balancing resources across users.
  2. We can use AWS, google cloud, etc, but we have to pay. Some research labs may find this unfamiliar. Those who have free access to institution wide resources would be even more reluctant.
  3. Some pipelines are able to call variants from 1 billion raw reads in 24 hours with multiple CPU cores. This is already good enough in comparison to the time and cost spent on sequencing. There is not a huge need of better technologies. In addition, although hadoop frequently saves wall-clock time due to its scalability, at times it wastes CPU cycles on its extra layer. In a production setting, the total CPU time across many jobs matters more than the wall-clock time of a single job. Some argue that the compute-close-to-data model of hadoop is better, but for many analyses we only read through data once. The data transferred over network is the same as dispatching data in the hadoop model.
  4. Improvements to algorithms frequently have much bigger impact on data processing than using a better technology. For example, there is a hadoop version of MarkDuplicates that takes much less wall-clock time (more CPU time, though) than Picard. However, recent streamed algorithms, such as SamBlaster and the new Picard, can do this faster in terms of both CPU and wall-clock time. For another example, there is a concern with the technical difficulty in multi-sample variant calling, so someone developed a hadoop-based caller. When it comes out, GATK has moved to gVCF, which solves the problem in a much better way, at least for deep sequencing. Personally, I would more like to improve algorithms than to adapt my working tools to hadoop.

For some large on-demand services, hadoop from massive cloud computing providers is hugely advantageous over the traditional computing model. Hadoop may also do a better job for certain bioinfo tasks (gVCF merging and de novo assembly coming into my mind). However, for the majority of analyses, hadoop only adds complexity and may even hurt performance.

ADD COMMENTlink written 6.0 years ago by lh332k
gravatar for Jeremy Leipzig
6.0 years ago by
Philadelphia, PA
Jeremy Leipzig19k wrote:

You are correct in noting most of Hadoop for bioinformatics papers are proofs of concept and real-world use of Hadoop in bioinformatics is quite low.

Hadoop combines two awesome bottlenecks to bring bioinformatics software to its knees - using the network to disperse data and then relying on disk IO to access it (often from the same networked drive).

There are some bioinformatics applications that may benefit from MapReduce but those tend to closely resemble the type of e-commerce problems Hadoop was designed to solve. In most use cases I suspect threaded programs designed for big ass servers would perform better than their Hadoop counterparts.

I am interested to see how the Spark/Avro/Parquet stack performs as it relies much more on RAM, and hence BAS boxes.

ADD COMMENTlink written 6.0 years ago by Jeremy Leipzig19k
gravatar for dw314159
6.0 years ago by
dw31415940 wrote:

I used Hadoop on a bioinformatics analysis of mRNA complexity. The analysis and results are described at Source code is provided.

ADD COMMENTlink written 6.0 years ago by dw31415940
gravatar for Daniel Swan
6.0 years ago by
Daniel Swan13k
Aberdeen, UK
Daniel Swan13k wrote:

This blog post from Abishek Tiwari is a couple of years old, but clearly shows there's a number of applications out there using this kind of methodology:

I imagine there are plenty more in the last couple of years.

ADD COMMENTlink written 6.0 years ago by Daniel Swan13k

Thanks, it would be nice to know if these papers are just papers or if the described tools also see any real world use (except GATK which uses a mapRecduce engine but does not run on Hadoop) . It's the ( lack of ) real world use that I am interested in not just the technical proof of concepts.

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by William4.7k
gravatar for William
6.0 years ago by
William4.7k wrote:

I did some more reading. The most promising development for genomics distributed computing world indeed (like lh3 mentioned ) looks to be Adam and the related formats and tool kits:

A genomics processing engine and specialized file format built using Apache Avro, Apache Spark and Parquet. Apache 2 licensed.

Current genomic file formats are not designed for distributed processing. ADAM addresses this by explicitly defining data formats as Apache Avro objects and storing them in Parquet files. Apache Spark is used as the cluster execution system.

Once you convert your BAM file to ADAM, it can be directly accessed by Hadoop Map-Reduce, Spark, Shark, Impala, Pig, Hive, whatever. Using ADAM will unlock your genomic data and make it available to a broader range of systems.

At the moment, we are working on three projects:

  • ADAM: A scalable API & CLI for genome processing
  • bdg-formats: Schemas for genomic data
  • avocado: A Variant Caller, Distributed

I don't (yet ) know if they have they support the full feature set of BWA-Picard-GATK with production quality but it sure looks interesting.

ADD COMMENTlink modified 11 months ago by RamRS30k • written 6.0 years ago by William4.7k

Wikipedia tells me Spark is up to 100x faster than Hadoop MapReduce, which begs the question: What was holding up Hadoop so much?

ADD REPLYlink written 6.0 years ago by Jeremy Leipzig19k

I guess hadoop is slow mainly because it uses disks too much. Spark is largely an improved RAM-oriented implementation of hadoop concepts. For the 100X speed up, the wiki links to a paper about shark, which is a spark-based SQL engine. For database queries, in-memory access will be of course faster than disk access by far. For other applications, the speedup may be marginal.

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by lh332k

You're right that Spark is faster in some use cases because it writes results to RAM but it's important to add that this is faster for iterative algorithms where the data is going to be used again and again. In these cases Spark reduces speed by reducing need to go all the way to the hard disk and back. Hadoop is still fast (faster?) when data only needs to be written back to disk once. Spark is expected to become more important than Hadoop over time.

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by maxwell.pietsch0

With Hadoop 2.0 which implements YARN instead of MapReduce as resource manager (suitable for streaming applications), Bioinformatics should just be the field for it.

ADD REPLYlink written 3.4 years ago by plabanbiswas9610
gravatar for u1058969
5.1 years ago by
u10589690 wrote:


I'm currently using R and Hadoop environment to research bioinformatics. 

For me, it's possible to do that if you have knowledge of Linux/ Java/ R/ Hadoop/ Biology and don't need to spend any cost for it because they are open softwares. 
Even you could develop your own packages to optimise the framework for R or Hadoop.



ADD COMMENTlink modified 5.1 years ago • written 5.1 years ago by u10589690
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1753 users visited in the last hour