Forum: Arvados vs Big Data Genomics
gravatar for Jeremy Leipzig
5.7 years ago by
Philadelphia, PA
Jeremy Leipzig19k wrote:

Has anyone tried one or both of these? Which is further along in terms of storage and retrieval of sequencing formats, especially variants?

arvados forum adam bdg • 3.4k views
ADD COMMENTlink modified 5.4 years ago by tth20 • written 5.7 years ago by Jeremy Leipzig19k

Fresh input on the topic? Has anyone adopted the technologies?

ADD REPLYlink written 3.5 years ago by podro0
gravatar for tth
5.4 years ago by
tth20 wrote:

Hi Jeremy,

I am currently looking into both technologies and they differ from what they provide.
Arvados has basically the approach to be a platform for data sharing and more important for provenance of derived or analyzed data. This is especially very appealing if you work in a global organisation where data is distributed to several sites and de-duplication is an important topic then. This is where Arvados seems to be a very valuable platform. Even every pipeline that runs currently in a usual shell environment can be ported to ARVADOS much faster than to e.g. a Cloudera platform.

They make heavy use of DOCKER, what basically means virtualisation on applicaton level. They implement their own Map-Reduce stack and there is where it gets tricky for me. I would like to use Spark and ADAM on e.g. a Cloudera platform having the genius data provenance and de-duplication features that ARVADOS provide combined with the big and innovative community of Cloudera.

If someone in this forum can help concerning this question it would be great to get some more insight. I think using ADAM would be feasible but it is unclear to me if it would be easily possible to run SPARK on top of the data stored in ARVADOS but I am still learning and reading. In case I find an answer I would post it here.

Before you ask, I do not contribute to Arvados ;)

ADD COMMENTlink written 5.4 years ago by tth20

Hello, I'm on the Arvados team, thank you for your insightful comment.

Regarding Adam/Arvados integration, I can't share any specific plans right now, but this is something we are very interested in and hope to work on in the future.

We are also heavily involved in the common workflow language effort to standarize how tools and workflows are described so they are portable over different platforms, which will make the distinctions between underlying clustering technologies like Spark and Yarn and Arvados Crunch less relevant for day to day bioinformatics that isn't working deep in the infrastructure layer.

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by peter.amstutz300
gravatar for WilliamS
5.7 years ago by
WilliamS320 wrote:

I am trying Adam / Spark / BigDataGenomics for storage and retrieval / analysis on 1000 genomes VCF data.

They are hoping to have a production release end of this year.

Arvados looks like they are building everything from scratch

while Adam is building on general purpose big data infra like Spark /HDFS / parquet / YARN. My bet would be one Adam, also because Berkley AMPLab, Broad and Mount Sinai are involved in the development.










ADD COMMENTlink written 5.7 years ago by WilliamS320

Hello! Depends what you're looking for with respect to variant storage and retrieval, but I would like to note that there is already a free hosted version of Arvados that people are welcome to evaluate (you can use any google account to login).

disclaimer: I contribute to Arvados. We would definitely appreciate any feedback you have! :)

ADD REPLYlink modified 5.6 years ago • written 5.6 years ago by Nancy Ouyang170
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1644 users visited in the last hour