Question: Distributed Computing In Bioinformatics
gravatar for Ngsnewbie
7.6 years ago by
Ngsnewbie360 wrote:

As of now we have some Hadoop based packages (crossbow, cloudburst etc) for NGS data analysis, still I find tools like bowtie, tophat, SOAP etc that people prefer in their work. I am a biologist but still I want to get some ideas that is it possible to use / convert serial tools into map-reduce form to exploit scalelable distributed computing using Hadoop to expedite research? Also what are the challenges in such mapping and assembling algorithms for using them in hadoop system.

I am also curious to know some other bioinformatics task which can done using hadoop based projects like hive, pig and hbase which deals with big data like fastq files, sam, count data or other form of biological data.

ngs • 7.3k views
ADD COMMENTlink modified 7.6 years ago by Jeremy Leipzig19k • written 7.6 years ago by Ngsnewbie360

Please, explain why you specifically want to use hadoop. You can always parallelize your analysis without a map/reduce process, cloud, etc....

ADD REPLYlink written 7.6 years ago by Pierre Lindenbaum131k

Actually I am just exploring the hadoop technology, so seeking the challenges or impact of hadoop technologies in NGS / Bioinformatics data analysis. I dont specifically want to use hadoop, but if i try with hadoop , will it be fruitful or not and what hurdles would be there?

ADD REPLYlink written 7.6 years ago by Ngsnewbie360
gravatar for Roman Valls Guimerà
7.6 years ago by
Roman Valls Guimerà530 wrote:

Well, if you want to explore it, looking at the current bio*-hadoop ecosystem and related fora is a good place to start:

There you can find tools like Seal and Hadoop-BAM which target the last part of your question more specifically.

Furthermore, albeit a bit old, the following video & slides still hold as a general view on Hadoop and bioinformatics:

Last but not least, a couple of my favourite blogs about hadoop, bigdata in biosciences (although not limited to them) are Follow the Data and mypopescu:

Hope that helps!

ADD COMMENTlink written 7.6 years ago by Roman Valls Guimerà530
gravatar for lh3
7.6 years ago by
United States
lh332k wrote:

Except de novo assembly, the bottleneck of NGS analyses is frequently read mapping and SNP calling. For these analyses, you can trivially split your read files for mapping and chromosomal regions for calling and run jobs separately on different computing nodes. Hadoop adds little benefit in this case while requiring a special set up which might (I am not sure) interfere with other non Hadoop jobs. I also see fewer researchers understanding how hadoop works as a big obstacle.

On the other hand, these concerns with hadoop are relatively minor technically. If you can move the most widely used bwa-picard-gatk pipeline to hadoop, there will be some potential users especially when they rely on amazon. Crossbow and cloudburst are not so popular partly because they are not implementing the best pipeline. Scientists usually choose accuracy over speed/convenience unless the difference in accuracy is negligible while the difference in speed is over a couple of orders of magnitude.

ADD COMMENTlink modified 12 months ago by RamRS30k • written 7.6 years ago by lh332k
gravatar for Jeremy Leipzig
7.6 years ago by
Philadelphia, PA
Jeremy Leipzig19k wrote:

One of the more compelling uses of Hadoop would be querying variants from thousands of individuals, as illustrated with Seqware here:

enter image description here

Two caveats stand out:

  1. equivalent BASS hardware (like this 32TB monster from Oracle) will still outperform distributed setups.

  2. In the example above couldn't they have simply divided individuals or variants between 6 machines running BerkeleyDB without being overly clever?

ADD COMMENTlink written 7.6 years ago by Jeremy Leipzig19k

I remmeber this paper. I think it falls into a typical trap for technical people: trying to put everything such as sam, vcf, bed and wig in a generic database and adding hardware when that does not work. This approach rarely gives satisfactory results in the NGS era. For huge amount of data, we need specialized treatments and occasionally advances in methodology. Such approaches can be orders of magnitude more efficient than a generic database. We had some interaction with a few top google engineers. When we chat about storing many BAMs/VCFs, their reaction was to first design a specialized binary representation, but not to put each record in their BigQuery or similar existing systems. That is the right direction.

ADD REPLYlink modified 7.6 years ago • written 7.6 years ago by lh332k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1398 users visited in the last hour