Question

I get a KeyError:0 when trying to use scatter

1

Entering edit mode

7.2 years ago

andersgs ▴ 20

Hello.

I am building a pipeline that starts by sub-sampling PE FASTQ files with seqtk. Unfortunately, seqtk does not accept PE files directly, so they have to to be fed one by one with the same seed number. I want to repeat this process several times with different seed numbers, and number of reads kept. Downstream, I am going to assemble these sub-sampled reads.

I have been inspired by https://github.com/h3abionet. Using their workflow as a template, I have managed to get pretty close to what I want.

I have created a new record schema to hold my data:

class: SchemaDefRequirement
types:
 - name: FilePair
   type: record
   fields:
     - name: forward
       type: File
     - name: reverse
       type: File
     - name: seed
       type: int[]
     - name: number
       type: int[]
     - name: rep
       type: int[]
     - name: id
       type: string

And, here is an input file I have created:

fqSeqs:
    - forward:
        class: File
        path: /pat/to//SRR2736093_1.fastq.gz
      reverse:
        class: File
        path: /path/to/SRR2736093_2.fastq.gz
      id: SRR2736093
      seed: [42,10]
      number: [10,10]
      rep: [1,2]
    - forward:
        class: File
        path: /path/to/SRR2736094_1.fastq.gz
      reverse:
        class: File
        path: /path/to/SRR2736093_4.fastq.gz
      id: SRR2736094
      seed: [69, 12]
      number: [10,10]
      rep: [1,2]

I then have a master workflow:

cwlVersion: v1.0
class: Workflow

requirements:
 - class: ScatterFeatureRequirement
 - class: InlineJavascriptRequirement
 - class: StepInputExpressionRequirement
 - class: SubworkflowFeatureRequirement
 - $import: readPair.yml

inputs:
    fqSeqs:
        type:
            type: array
            items: "readPair.yml#FilePair"

outputs:
    fqout:
        type: "readPair.yml#FilePair[]"
        outputSource: subsample/resampled_fastq

steps:
    subsample:
        in:
            onePair: fqSeqs
        scatter: onePair
        out: [resampled_fastq]
        run: seqtk_sample_PE.cwl

The sub-workflow seqtk_sample_PE.cwl makes sure seqtk is run appropriately across each pair of FASTQ:

cwlVersion: v1.0
class: Workflow

requirements:
 - class: ScatterFeatureRequirement
 - class: InlineJavascriptRequirement
 - class: StepInputExpressionRequirement
 - $import: readPair.yml

inputs:
    onePair: "readPair.yml#FilePair"

outputs:
    resampled_fastq:
        type: "readPair.yml#FilePair"
        outputSource: collect_output/fastq_pair_out
steps:
    subsample_1:
        in:
            fastq:
                source: onePair
                valueFrom: $(self.forward)
            seed:
                source: onePair
                valueFrom: $(self.seed)
            number:
                source: onePair
                valueFrom: $(self.number)
            rep:
                source: onePair
                valueFrom: $(self.rep)
        scatter: seed
        scatterMethod: dotproduct
        out: [seqtkout]
        run: seqtk_sample.cwl
    subsample_2:
        in:
            fastq:
                source: onePair
                valueFrom: $(self.reverse)
            seed:
                source: onePair
                valueFrom: ${
                    console.log(self.seed);
                    return self.seed;}
            number:
                source: onePair
                valueFrom: $(self.number)
            rep:
                source: onePair
                valueFrom: $(self.rep)
        scatter: seed
        scatterMethod: dotproduct
        out: [seqtkout]
        run: seqtk_sample.cwl
    collect_output:
        run:
            class: ExpressionTool
            inputs:
                seq_1: File
                seq_2: File
                id: string
            outputs:
                fastq_pair_out: "readPair.yml#FilePair"
            expression: >
                ${
                    var ret={};
                    ret['forward'] = inputs.seq_1
                    ret['reverse'] = inputs.seq_2
                    ret['id'] = inputs.id
                    return { 'fastq_pair_out' : ret }
                }
        in:
            seq_1: subsample_1/seqtkout
            seq_2: subsample_2/seqtkout
            id:
                source: onePair
                valueFrom: $self.id)
        out:
            [ fastq_pair_out ]

And, finally, seqtk_sample.cwl actually does the work:

cwlVersion: v1.0
class: CommandLineTool

baseCommand: ['seqtk', 'sample']
stdout: $(inputs.fastq.nameroot)_$(inputs.number)_$(inputs.seed)_$(inputs.rep).fq
inputs:
    seed:
        type: int
        inputBinding:
            prefix: -s
            position: 1
    fastq:
        type: File
        inputBinding:
            position: 2
    number:
        type: int
        inputBinding:
            position: 3
outputs:
    seqtkout:
        type: stdout

However, when I try to run the master workflow, I get the following error:

[workflow subsample] initialized from file:///Users/andersg/Documents/dev/mdu-qc-cwl/workflows/seqtk_sample_PE.cwl
[workflow subsample] workflow starting
[workflow subsample] starting step subsample_2
Unhandled exception
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/cwltool/workflow.py", line 311, in try_make_job
    Callable[[Any], Any], callback), **kwargs)
  File "/usr/local/lib/python2.7/site-packages/cwltool/workflow.py", line 672, in dotproduct_scatter
    jo[s] = joborder[s][n]
  File "/usr/local/Cellar/python/2.7.12_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ruamel/yaml/comments.py", line 502, in __getitem__
    return ordereddict.__getitem__(self, key)
KeyError: 0
[workflow subsample] outdir is /var/folders/fj/s582ngbs28d78t98hf4gv0qjt74n0_/T/tmpchJbVd
Workflow cannot make any more progress.
Removing intermediate output directory /var/folders/fj/s582ngbs28d78t98hf4gv0qjt74n0_/T/tmpchJbVd
Removing intermediate output directory /var/folders/fj/s582ngbs28d78t98hf4gv0qjt74n0_/T/tmpRzqZLf
Final process status is permanentFail

I seems that I am not specifying my arrays correctly?

Any help would be greatly appreciated. Thank you.

Anders.

cwl scatter/gather • 2.4k views

ADD COMMENT • link updated 7.2 years ago by Michael R. Crusoe ★ 1.9k • written 7.2 years ago by andersgs ▴ 20

0

Entering edit mode

If you have these files in GitHub or another easy to download location than that'd make debugging easier :-)

ADD REPLY • link 7.2 years ago by Michael R. Crusoe ★ 1.9k

score 2 · Accepted Answer · 2017-03-31

2

Entering edit mode

7.2 years ago

Michael R. Crusoe ★ 1.9k

Hello andersgs,

Very advanced CWL usage!

The problem is you are trying to scatter over a component of a single item, but that is not currently allowed in the CWL v1.0: http://www.commonwl.org/v1.0/Workflow.html#WorkflowStepInput

The value of inputs in the parameter reference or expression must be the input object to the workflow step after assigning the source values and then scattering. The order of evaluating valueFromamong step input parameters is undefined and the result of evaluating valueFrom on a parameter must not be visible to evaluation of valueFrom on other parameters.

ADD COMMENT • link 7.2 years ago by Michael R. Crusoe ★ 1.9k

1

Entering edit mode

Hi Michael.

Thank you very much. Your solution worked.

I have posted it here: https://github.com/andersgs/cwl_flows

For completeness, here is the main workflow:

cwlVersion: v1.0
class: Workflow

requirements:
 - class: ScatterFeatureRequirement
 - class: InlineJavascriptRequirement
 - class: StepInputExpressionRequirement
 - class: SubworkflowFeatureRequirement
 - $import: readPair.yml

inputs:
    fqSeqs:
        type:
            type: array
            items: "readPair.yml#FilePair"

outputs:
    fqout:
        type: "readPair.yml#FilePair[]"
        outputSource: subsample/resampled_fastq

steps:
    subsample:
        in:
            forward:
                source: fqSeqs
                valueFrom: $(self.forward)
            reverse:
                source: fqSeqs
                valueFrom: $(self.reverse)
            seqid:
                source: fqSeqs
                valueFrom: $(self.seqid)
            seed:
                source: fqSeqs
                valueFrom: $(self.seed)
            number:
                source: fqSeqs
                valueFrom: $(self.number)
            rep:
                source: fqSeqs
                valueFrom: $(self.rep)
        scatter: [forward, reverse, seqid, seed, number, rep]
        scatterMethod: dotproduct
        out: [resampled_fastq]
        run: seqtk_sample_PE.cwl

And, here is the seqtk_sample_PE.cwl:

cwlVersion: v1.0
class: Workflow

requirements:
 - class: ScatterFeatureRequirement
 - class: InlineJavascriptRequirement
 - class: StepInputExpressionRequirement
 - $import: readPair.yml

inputs:
    forward: File
    reverse: File
    seqid: string
    seed: int[]
    number: int[]
    rep: int[]

outputs:
    resampled_fastq:
        type: "readPair.yml#FilePair"
        outputSource: collect_output/fastq_pair_out
steps:
    subsample_1:
        in:
            fastq: forward
            seed: seed
            number: number
            rep: rep
            seqid: seqid
            read_number:
                valueFrom: ${return 1;}
        scatter: [seed, number, rep]
        scatterMethod: dotproduct
        out: [seqtkout]
        run: seqtk_sample.cwl
    subsample_2:
        in:
            fastq: reverse
            seed: seed
            number: number
            rep: rep
            seqid: seqid
            read_number:
                valueFrom: ${return 2;}
        scatter: [seed, number, rep]
        scatterMethod: dotproduct
        out: [seqtkout]
        run: seqtk_sample.cwl
    collect_output:
        run:
            class: ExpressionTool
            inputs:
                seq_1:
                    type:
                        type: array
                        items: File
                seq_2:
                    type:
                        type: array
                        items: File
                seqid: string
            outputs:
                fastq_pair_out: "readPair.yml#FilePair"
            expression: >
                ${
                    var ret=[];
                    for (var i = 0; i < inputs.seq_1.length; ++i) {
                        var tmp = {}
                        tmp['forward'] = inputs.seq_1[i];
                        tmp['reverse'] = inputs.seq_2[i];
                        tmp['seqid'] = inputs.seqid
                        ret.push(tmp);
                    }
                    return { 'fastq_pair_out' : ret }
                }
        in:
            seq_1: subsample_1/seqtkout
            seq_2: subsample_2/seqtkout
            seqid: seqid
        out:
            [ fastq_pair_out ]

Much appreciated!

Cheers.

Anders.

ADD REPLY • link 7.2 years ago by andersgs ▴ 20

0

Entering edit mode

You might be able to work around this by changing the inputs of seqtk_sample_PE.cwl to be each element of your custom FilePair type and connecting them in the main workflow using valueFrom there.

ADD REPLY • link 7.2 years ago by Michael R. Crusoe ★ 1.9k

0

Entering edit mode

What you are trying to do is reasonable, so I've filed an issue on it: https://github.com/common-workflow-language/common-workflow-language/issues/419

ADD REPLY • link 7.2 years ago by Michael R. Crusoe ★ 1.9k