CWL: file format metadata validation
1
0
Entering edit mode
7.1 years ago

Hi, I'm doing my first baby steps with CWL and I was wondering if it is possible to annotate the formats of the files specified in a tool/workflow specification in order to catch subtle errors.

An example:

inputs:
  normal_bam:
    type: File
    format:
      type: bam
      assert:
        source: normal
        patient_id: $(inputs.patient_id)
    secondaryFiles: .bai
    inputBinding:
      prefix: -I:normal
  tumor_bam:
    type: File
    format:
      type: bam
      assert:
        source: tumor
        patient_id: $(inputs.patient_id)
    secondaryFiles: .bai
    inputBinding:
      prefix: -I:tumor
  reference:
    type: File
    format:
      type: fasta
      assert:
        content: genome
        oraganism: homo_sapiens
    secondaryFiles: [.fai, ^.dict]
    inputBinding:
      prefix: --reference_sequence
  patient_id:
    type: string

in this case CWL could easily catch errors like passing a BAM file containing a tumor sample instead than one containing a normal sample.

I think that this could be implemented partly by extendind an ontology but I see this becoming tedious if you have for example to generate all possible combinations of content and organism in the fasta field (but maybe I'm wrong not being expert in ontologies at all). Moreover this could be a dynamic feature where attributes are passed as inputs and/or attached to outputs and passed down in the inputs of other workflows/tools as a pipeline is run.

Thanks!


Edit: revised the title to better explain the concept/idea

cwl • 1.8k views
ADD COMMENT
0
Entering edit mode
7.1 years ago

Hello Tarcisio Fedrizzi,

I think this is a great idea, thank you for sharing it. 👍

You can specify file formats already in CWL v1.0: http://www.commonwl.org/v1.0/CommandLineTool.html#CommandInputParameter

inputs:
  normal_bam:
    type: File
    format: http://edamontology.org/format_2572  # BAM

or using $namespace to enable the edam prefix for brevity and $schemas to be very specific about which schema version is in use.

inputs:
  normal_bam:
    type: File
    format: edam:format_2572  # BAM

$namespaces: { edam: "http://edamontology.org/" }
$schemas: [ "http://edamontology.org/EDAM_1.16.owl" ]

Also using cwlVersion: v1.0 you can have a step in your workflow that verifies aspects of your data & metadata and fails if they don't meet your requirements.

The big question is, how will this metadata be represented? That is going to be fairly sub-discipline specific, so with CWL we aren't trying to solve that for every community. Instead we've made it possible for each community to plug in ontologies that they've developed and reason over them within the common framework.

ADD COMMENT

Login before adding your answer.

Traffic: 2705 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6