Question: CWL passing secondary files in a workflow
0
gravatar for ttom
4 months ago by
ttom200
ttom200 wrote:

I have a CWL file which has bam file as one of the inputs and takes bai file as secondary files. The script as CommandLineTool works and gives results when run in cwl-runner.

cat degradation.cwl

cwlVersion: v1.0
    class: CommandLineTool
    baseCommand: [python, degradation.py]

    inputs:
     annotation:
      type: File
      inputBinding:
       position: 2
       prefix: -a
     bam:
      type: File
      inputBinding:
       position: 1
       prefix: -n
      secondaryFiles: .bai

Now I would like to add this script as one of the steps in the workflow. How can this be done ?

cat workflow.cwl

cwlVersion: v1.0
class: Workflow
inputs:
------
------
outputs:
 alignment_out:
  type: File
   outputSource: star/star_bam
steps:
 star:
  run: star.cwl
  in: ---
   -------
  out: [star_bam]
 bam_indexing:
  run: index_bam.cwl
  in:
   bam: star/star_bam
   out: [bai]
 rna_degradation:
  run: degradation.cwl
  in:
   annotation: annotation
   bam: star/star_bam
   bai: bam_indexing/bai

Error

   expects secondaryFiles: .bai but
       source 'star_bam' does not include secondaryFiles.
       To fix, add secondaryFiles: .bai to definition of 'star_bam'.

The star_bamis an output from the step star and bai is an output from another step bam_indexing. In that case, how could baibe given as an input(secondaryFile) to the step rna_degradation

cwl • 391 views
ADD COMMENTlink modified 4 months ago by biokcb150 • written 4 months ago by ttom200
1
gravatar for biokcb
4 months ago by
biokcb150
biokcb150 wrote:

It looks like part of the problem here is that you are trying to use an input bai that doesn't exist in your above degradation.cwl script. You could probably do one of two things:

1) Modify degradation.cwl to take the .bai file explicitly as input then use your workflow.cwl as is just by adding an input as bai: File to inputs of degradation.cwl. This may require use of InitialWorkDirRequirement, but I'm not sure so I'd test this out first before adding it.

degradation.cwl

inputs:
  bai: File
  annotation:
    type: File
    inputBinding:
     position: 2
     prefix: -a
  bam:
    type: File
    inputBinding:
     position: 1
     prefix: -n
    secondaryFiles: .bai

OR

2) Modify your workflow.cwl to return an object that has the secondary file as a secondary file. You can either add a step between bam_indexing and degradation that is an Expression tool that returns a File object + secondary file. A potentially easier method, depending on how index_bam.cwl is written, you can probably use .bam file as input and return as output the bam file + bai as its secondary file instead of returning the bai alone. Like so

index_bam. cwl

outputs:
  indexed_bam:
    type: File
    secondaryFiles: .bai
    outputBinding:
      glob: $(inputs.bam.basename)

If index_bam.cwl doesn't work like this, then I would add the expression tool, but try one of these suggestions for now and see how it works. My preference is for option (2), but let me know what ends up working if anything. And if it doesn't or doesn't make sense, you'll need to post your full CWL document to help me understand better why it doesn't work.

ADD COMMENTlink written 4 months ago by biokcb150

Option 1 The script used in degradation.cwl does not have an option to give bai files as an input explicitly. It rather needs/searches for the bai file within the same path where the bam file exists. Hence giving bai as secondary file worked.

Option 2 Trial index_bam.cwl only returns a bai file and does not return any bam file. So giving bam as output would not work I guess. Anyways I tried and the results have been pasted down.

index_bam.cwl (Working code)

cwlVersion: v1.0
class: CommandLineTool
baseCommand: [samtools]
doc: "samtools: index"

inputs:
 bam:
  type: File
  inputBinding:
   position: 1
   prefix: index
outputs:
 bai:
  type: stdout

stdout: $(inputs.bam.basename).bai

index_bam.cwl (NOT Working code)

cwlVersion: v1.0
class: CommandLineTool
baseCommand: [samtools]
doc: "samtools: index"

inputs:
 bam:
  type: File
  inputBinding:
   position: 1
   prefix: index
outputs:
 indexed_bam:
  type: File
  secondaryFiles: .bai
  outputBinding:
   glob: $(inputs.bam.basename)

ERROR

cwl-runner index_bam.cwl index_bam.yml 
conda/bin/cwl-runner 1.0.20180521150620
Resolved 'index_bam.cwl' to 'index_bam.cwl'
[job index_bam.cwl] /tmp/tmp7UywDj$ samtools \
    index \
    /tmp/tmpZe6GNu/stga7547f8b-8d5f-48d5-8d84-4348bb8e95ae/SampleA.bam
[job index_bam.cwl] Job error:
Error collecting output for parameter 'indexed_bam':
index_bam.cwl:27:4: Did not find output file with glob pattern: ‘[‘SampleA.bam']'
[job index_bam.cwl] completed permanentFail
{}
ADD REPLYlink modified 4 months ago • written 4 months ago by ttom200
1

Try adding an initial work dir requirement

index_bam.cwl (NOT Working code)

cwlVersion: v1.0
class: CommandLineTool
baseCommand: [samtools]
doc: "samtools: index"

requirements:
  InitialWorkDirRequirement:
    listing: [ $(inputs.bam) ]

inputs:
 bam:
  type: File
  inputBinding:
   position: 1
   prefix: index
outputs:
 indexed_bam:
  type: File
  secondaryFiles: .bai
  outputBinding:
   glob: $(inputs.bam.basename)

As another note, if you are using commonly used bioinformatics tools, it may be useful for you to try one of the prewritten CWL scripts here: https://github.com/common-workflow-language/workflows/tree/master/tools for your tools to hopefully make things run well without too much effort on your end. There is even a samtools-index.cwl available.

ADD REPLYlink written 4 months ago by biokcb150

Yes, makes sense to use the pre-written codes. Thank you !!

By adding InitialWorkDirRequirement, the script index_bam.cwl could capture bam and bai outputs

cat index_bam.cwl

requirements:
 InitialWorkDirRequirement:
  listing: [ $(inputs.bam) ]


inputs:
 bam:
  type: File
  inputBinding:
   position: 1
   prefix: index

outputs:
 indexed_bam:
  type: File
  secondaryFiles: .bai
  outputBinding:
   glob: $(inputs.bam.basename)

But still have problems in the workflow when trying to input the secondary file outputs from the step bam_indexing to the step run_degradation

The step run_degradation is still asking for baiinput

cat workflow.cwl

cwlVersion: v1.0
class: Workflow
inputs:
------
------
outputs:
 alignment_out:
  type: File
   outputSource: star/star_bam
steps:
 star:
  run: star.cwl
  in: ---
   -------
  out: [star_bam]
 bam_indexing:
  run: index_bam.cwl
  in:
   bam: star/star_bam
  out: [indexed_bam]
 rna_degradation:
  run: degradation.cwl
  in:  
   annotation: annotation
   bam: bam_indexing/indexed_bam

workflow.cwl:149:3: Step is missing required parameter 'bai'

Line 149 is the line with in: of the step degradation.cwl in the workflow.cwl

ADD REPLYlink modified 3 months ago • written 3 months ago by ttom200
1

In your degradation.cwl script remove the bai entry under inputs

ADD REPLYlink written 3 months ago by biokcb150

Ah.... my bad. I thought I removed it, but hadn't.

With that line removed it works. Thank you...

ADD REPLYlink written 3 months ago by ttom200
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1915 users visited in the last hour