I am currently looking into both technologies and they differ from what they provide.
Arvados has basically the approach to be a platform for data sharing and more important for provenance of derived or analyzed data. This is especially very appealing if you work in a global organisation where data is distributed to several sites and de-duplication is an important topic then. This is where Arvados seems to be a very valuable platform. Even every pipeline that runs currently in a usual shell environment can be ported to ARVADOS much faster than to e.g. a Cloudera platform.
They make heavy use of DOCKER, what basically means virtualisation on applicaton level. They implement their own Map-Reduce stack and there is where it gets tricky for me. I would like to use Spark and ADAM on e.g. a Cloudera platform having the genius data provenance and de-duplication features that ARVADOS provide combined with the big and innovative community of Cloudera.
If someone in this forum can help concerning this question it would be great to get some more insight. I think using ADAM would be feasible but it is unclear to me if it would be easily possible to run SPARK on top of the data stored in ARVADOS but I am still learning and reading. In case I find an answer I would post it here.
Before you ask, I do not contribute to Arvados ;)
I am trying Adam / Spark / BigDataGenomics for storage and retrieval / analysis on 1000 genomes VCF data.
They are hoping to have a production release end of this year.
Arvados looks like they are building everything from scratch
while Adam is building on general purpose big data infra like Spark /HDFS / parquet / YARN. My bet would be one Adam, also because Berkley AMPLab, Broad and Mount Sinai are involved in the development.