Question: CWL: file format metadata validation
0
gravatar for Tarcisio Fedrizzi
2.4 years ago by
Tarcisio Fedrizzi0 wrote:

Hi, I'm doing my first baby steps with CWL and I was wondering if it is possible to annotate the formats of the files specified in a tool/workflow specification in order to catch subtle errors.

An example:

inputs:
  normal_bam:
    type: File
    format:
      type: bam
      assert:
        source: normal
        patient_id: $(inputs.patient_id)
    secondaryFiles: .bai
    inputBinding:
      prefix: -I:normal
  tumor_bam:
    type: File
    format:
      type: bam
      assert:
        source: tumor
        patient_id: $(inputs.patient_id)
    secondaryFiles: .bai
    inputBinding:
      prefix: -I:tumor
  reference:
    type: File
    format:
      type: fasta
      assert:
        content: genome
        oraganism: homo_sapiens
    secondaryFiles: [.fai, ^.dict]
    inputBinding:
      prefix: --reference_sequence
  patient_id:
    type: string

in this case CWL could easily catch errors like passing a BAM file containing a tumor sample instead than one containing a normal sample.

I think that this could be implemented partly by extendind an ontology but I see this becoming tedious if you have for example to generate all possible combinations of content and organism in the fasta field (but maybe I'm wrong not being expert in ontologies at all). Moreover this could be a dynamic feature where attributes are passed as inputs and/or attached to outputs and passed down in the inputs of other workflows/tools as a pipeline is run.

Thanks!


Edit: revised the title to better explain the concept/idea

cwl • 990 views
ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by Tarcisio Fedrizzi0
0
gravatar for Michael R. Crusoe
2.4 years ago by
Common Workflow Language project
Michael R. Crusoe1.6k wrote:

Hello Tarcisio Fedrizzi,

I think this is a great idea, thank you for sharing it. 👍

You can specify file formats already in CWL v1.0: http://www.commonwl.org/v1.0/CommandLineTool.html#CommandInputParameter

inputs:
  normal_bam:
    type: File
    format: http://edamontology.org/format_2572  # BAM

or using $namespace to enable the edam prefix for brevity and $schemas to be very specific about which schema version is in use.

inputs:
  normal_bam:
    type: File
    format: edam:format_2572  # BAM

$namespaces: { edam: "http://edamontology.org/" }
$schemas: [ "http://edamontology.org/EDAM_1.16.owl" ]

Also using cwlVersion: v1.0 you can have a step in your workflow that verifies aspects of your data & metadata and fails if they don't meet your requirements.

The big question is, how will this metadata be represented? That is going to be fairly sub-discipline specific, so with CWL we aren't trying to solve that for every community. Instead we've made it possible for each community to plug in ontologies that they've developed and reason over them within the common framework.

ADD COMMENTlink written 2.4 years ago by Michael R. Crusoe1.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1644 users visited in the last hour