Question: Parse CWL to graph data structure or database?
0
gravatar for mcpherson
13 months ago by
mcpherson40
Los Alamos National Laboratory
mcpherson40 wrote:

We're considering building a new "runner". We envision using a graph database (e.g. Neo4j) to describe and control the running workflow. Step one is to convert a CWL file to a graph. Is there any extant software/library to do this? We've looked at extracting parse code from cwltool or Toil but that looks like an overly complicated hack. Python YAML packages (e.g. pyyaml) exist, or cwl-utils, but those don't appear to parse to a graph (nodes, dependencies, etc.). Maybe, given a dictionary from pyyaml, the task/dependency structure is easily apparent and we just need to look a little deeper?

cwl neo4j • 570 views
ADD COMMENTlink modified 3 months ago by ngsbioinformatics30 • written 13 months ago by mcpherson40

Michael,

Thanks for your prompt response.

Honestly, I was hoping my first post would result in a "Hey, we already did that..." reply from the community ;-)

We were able to run the cwltool --print... versions. They seem to work fine.

I probably didn't look hard enough at cwl-utils. I need to work on understanding the relationship between Schema Salad and that tool. At first glance it looked like a _verifyier_ of a CWL file. I will look at the link you provided.

Thanks for the hint on cwltool --pack.

At this point our test case CWL files (test run with Toil, cwl-runner) are very simplistic. Is there a repository of canonical test CWL files that are relatively large and complicated (and cover the spectrum of CWL capabilities)? The Bio folks seem to have some monstrous ones, but "blessed" examples would be preferable.

Thanks again for your help!

ADD REPLYlink written 13 months ago by mcpherson40
1

cwl-utils is a collection of scripts to demonstrate the use of the new Python classes for loading and parsing CWL v1.0 documents. schema-salad-tool --codegen python https://github.com/common-workflow-language/common-workflow-language/raw/master/v1.0/CommonWorkflowLanguage.yml was used to create https://github.com/common-workflow-language/cwl-utils/blob/master/cwl_utils/parser_v1_0.py

(Thanks for asking, I just updated the README with the above explanation)

You may find the CWL conformance tests to be useful, though I don't recommend reading them for style hints :-)

https://github.com/common-workflow-language/common-workflow-language/blob/master/CONFORMANCE_TESTS.md https://github.com/common-workflow-language/common-workflow-language/blob/master/v1.0/conformance_test_v1.0.yaml https://github.com/common-workflow-language/common-workflow-language/tree/master/v1.0/v1.0

FYI: cwl-runner is the generic name for any CWL runner. The CWL reference runner is cwltool.

ADD REPLYlink written 13 months ago by Michael R. Crusoe1.8k

I too am looking for something like what you described. I want to create a workflow from a python script (or any language for the matter), convert to a digraph and store. The digraph can then be converted to CWL or WDL or anything else. We would also need to import the digraph from CWL to convert to another workflow language. I haven't seen anything like this yet. If you have started implementing something, please post what it is and where we can find it. Thanks.

ADD REPLYlink written 3 months ago by ngsbioinformatics30
1

Official cwl support has moved to this forum from Biostars. Please post this there.

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax87k
2
gravatar for Michael R. Crusoe
13 months ago by
Common Workflow Language project
Michael R. Crusoe1.8k wrote:

Hello mcpherson,

That is exciting to hear about your plans!

Did you explore using the RDF graph output from the CWL reference runner via cwltool --print-rdf?

I'm also surprised that cwl-utils didn't seem useful to you all. It does take another recursive traversal to load everything, but once done you get real Python objects for the entire CWL description: see https://github.com/common-workflow-language/cwl-utils/blob/master/cwl_utils/docker_extract.py for an example.

If you do want to use an off the shelf YAML/JSON library I recommend passing the CWL descriptions through cwltool --pack to get a simpler, normalized form.

I hope this was helpful,

ADD COMMENTlink written 13 months ago by Michael R. Crusoe1.8k
3
gravatar for mcpherson
12 months ago by
mcpherson40
Los Alamos National Laboratory
mcpherson40 wrote:

For those arriving via search, I'm going to answer my own question as a summary of what I've found searching and experimenting. Many thanks to Michael for his responses and help.

Again, the premise. We (LANL BEE) view a workflow most naturally as a graph. We also favor the "property store" model over the "triple store" model (hence the choice of Neo4j as our baseline store). See this Stack Overflow answer for a quick summary of differences between these two approaches. In this model the steps (or tasks) are nodes and the dependencies between these tasks are the edges. Properties on these nodes and edges represent things like files consumed/produced, executable names, software/system requirements, etc. So, our goal (and the subject of this question) is how to parse a CWL file into this kind of graph. Here's a summary of what I've found:

  • cwltool: The reference CWL implementation (mentioned above by Michael). This will read, process (verify), and execute a CWL workflow. It also includes useful features for transforming CWL to other "formats" (see below). There is useful documentation at the end of the README.rst file on its operation. I suppose one could "cannibalize" this code to produce a graph of the input workflow, but it probably isn't the optimal way to proceed. As far as I can tell, it is written from scratch and doesn't use any external CWL "tooling" (e.g. parser_v1_0.py discussed below).
  • cwl-utils, and in particular, parser_v1_0.py from that repo: This repo contains tests and examples of the auto-generated CWL parser. The parser, in Python, is generated directly from the Schema Salad description of CWL. The parser builds Python objects (e.g. WorkflowInputStep, ExpressionTool, etc.) rather than a Python dictionary. Other than an example provided by Michael (and the small ones in the repo) I was unable to find examples of it used in the wild (see my answer to my own question on that subject). However, at this point, it looks like the most likely base for our future work.
  • cwltool --print-rdf: This will dump a CWL file to a graph in the RDF format. This file can then be processed using RDF tools (e.g. RDFLib, SparQL). As I mentioned earlier, we don't view an RDF triple graph as ideal for our purposes. YMMV.
  • YAML: As mentioned in this question, CWL files can be processed using a YAML library like PyYAML which will return a Python dictionary representing the CWL workflow.
  • cwltool --pack: You may also find this tool useful to collect a CWL workflow with external file references into a single CWL file.

One thing all the approaches listed above have in common is that they only handle the "syntactic" content of the CWL workflow (very broadly speaking). The "semantic" information inherent in the CWL workflow must be constructed by a runner (a runner is a code that executes a CWL workflow). For example, the CWL file specifies inputs and outputs of the workflow and the individual steps (the "syntax"). The dependencies between steps are defined by these inputs and outputs. They need to to be connected (the "semantics") to construct a real representation of a workflow (or determined on-the-fly as in cwltool). At least in our case, that means building a property graph. At the outset, I had hoped that someone had already done this. Alas, that doesn't appear to be the case, but the above tools should help us do this ourselves.

If I've gotten any of this terribly wrong, I hope an expert on this forum will correct me.

ADD COMMENTlink modified 12 months ago • written 12 months ago by mcpherson40

Great summary mcpherson!

Yes, we'd like to add more of the semantic processing to parser_v1_N.py either automatically or via an external file that gets included at code generation time.

Some aspects of the graph you'll have to leave slightly abstract until execution time: scattering can't be realized until you know the size (and perhaps the values) of the inputs, and the coming conditionals feature.

ADD REPLYlink written 12 months ago by Michael R. Crusoe1.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 734 users visited in the last hour