Question

De-Novo Assembly pipeline on hadoop

1

Entering edit mode

9.5 years ago

chefarov ▴ 170

Hello all,

I am planning to develop an De-novo assembly tool similar to Trinity which will make use of Hadoop framework.

In order to do so I would use a hadoop de-novo assembler (cloudbrush or contrail) and add the analysis steps (gene expression, gene functionality, etc), so that we have an automated "pipeline" tool to perform analysis scenarios automatically.

This is going to be my Thesis project, so I wanted to ask if anyone knows of a similar work done already.

After searching for several hours I can't find anything identical, the most similar concept seems to be galaxy-hadoop integration, which is a different thing as far as I understand, since you need to write your own hadoop tools and then wrap them to galaxy.

Am I missing something?

Thanks for your time in advance,

Stelios

next-gen rna-seq hadoop Assembly dna-seq • 3.2k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by chefarov ▴ 170

1

Entering edit mode

EDIT: I was wrong, sorry OP!

Original off-point post below:

What advantage does hadoop offer over the already easily parallelized Trinity (using a cluster's batch system), or over the MPI/C++ implementation that IBM is helping the Trinity team develop?

How do you plan to overcome places where knowledge of global states/data are needed? Using trinity as an example, all of inchworm and parts of chrysalis are only able to run on a single (large memory) node. Any parallelization of these steps would require communication, if there's a parallel algorithm that can allow for sufficient speed up, and are therefore not idea for hadoop.

Parallelization of trinity is probably a very tough task and might not be worthwhile to explore, which may be one of the reasons why the ideal hardware is a single very large memory machine. It may be the case that these parts of trinity are not only difficult to parallelize, but may not benefit significantly (if at all) from parallelization.

You can imagine a case where one node may have several reads that it doesn't need but one or more of the other nodes do need. Since it isn't possible to know which node needs these reads, you'd have to broadcast them all other nodes. So each node churns through its reads and then has to broadcast the remaining ones to all other nodes, for all nodes. This means there will be tons of processes not doing anything. You would also have issues with uneven distributions of reads, one node may only end up keeping a few, one node may need more reads than it has memory for.

This is before you get to any of the joys of trying to parallelize graph problems.

You way want to ask yourself a different question: Why aren't there any hadoop based RNA-Seq assembly tools? Is it because no one has thought of it, or is it because there are reasons why it isn't ideal or it didn't work? I know hadoop is in vogue right now, but what does it really bring to the table?

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by pld 5.1k

2

Entering edit mode

Hi Joe,

Thanks for your reply.

First of all I am not trying to parallelize trinity for hadoop. I am planning to use an already developed hadoop DNA-seq assembler (cloudbrush), by taking the fasta output in order to perform my postprocessing. That's what I meant by saying 'similar to trinity', I actually meant the features/output not the methodology to do the assembly part, since I don't intend to make an assembler.

I don't argue that load balancing on the reads processing would be better in hadoop, besides I don't have the necessary experience in distributed computing and parallelization techniques to do that. Thus I will consider very carefully what you said.

Since they chose to adjust trinity for these grid architectures and the MPI, they are probably better, but I suppose there could be cases where one has access only to a hadoop cluster (locally), even if that's not the ideal scenario

To tell the truth what I saw was that cloudbrush scalability seemed good, meaning that the execution time drops linearly when we increase the number of nodes (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3521391/pdf/1471-2164-13-S7-S28.pdf page 13).

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by chefarov ▴ 170

1

Entering edit mode

Sorry about that, I missed your point!

I guess then it isn't as clear to me why Hadoop hasn't been used in the capacity you're describing. All I can think of is a cultural penchant for managing all of this with perl/bash/make/etc and maybe flexibility or a desire to avoid investing in platforms.

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by pld 5.1k

1

Entering edit mode

No problem. Your comment was (in an another way) useful for me :)

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by chefarov ▴ 170