Hello.
I am building a pipeline that starts by sub-sampling PE FASTQ files with seqtk. Unfortunately, seqtk does not accept PE files directly, so they have to to be fed one by one with the same seed number. I want to repeat this process several times with different seed numbers, and number of reads kept. Downstream, I am going to assemble these sub-sampled reads.
I have been inspired by https://github.com/h3abionet. Using their workflow as a template, I have managed to get pretty close to what I want.
I have created a new record schema to hold my data:
class: SchemaDefRequirement
types:
- name: FilePair
type: record
fields:
- name: forward
type: File
- name: reverse
type: File
- name: seed
type: int[]
- name: number
type: int[]
- name: rep
type: int[]
- name: id
type: string
And, here is an input file I have created:
fqSeqs:
- forward:
class: File
path: /pat/to//SRR2736093_1.fastq.gz
reverse:
class: File
path: /path/to/SRR2736093_2.fastq.gz
id: SRR2736093
seed: [42,10]
number: [10,10]
rep: [1,2]
- forward:
class: File
path: /path/to/SRR2736094_1.fastq.gz
reverse:
class: File
path: /path/to/SRR2736093_4.fastq.gz
id: SRR2736094
seed: [69, 12]
number: [10,10]
rep: [1,2]
I then have a master workflow:
cwlVersion: v1.0
class: Workflow
requirements:
- class: ScatterFeatureRequirement
- class: InlineJavascriptRequirement
- class: StepInputExpressionRequirement
- class: SubworkflowFeatureRequirement
- $import: readPair.yml
inputs:
fqSeqs:
type:
type: array
items: "readPair.yml#FilePair"
outputs:
fqout:
type: "readPair.yml#FilePair[]"
outputSource: subsample/resampled_fastq
steps:
subsample:
in:
onePair: fqSeqs
scatter: onePair
out: [resampled_fastq]
run: seqtk_sample_PE.cwl
The sub-workflow seqtk_sample_PE.cwl
makes sure seqtk is run appropriately across each pair of FASTQ:
cwlVersion: v1.0
class: Workflow
requirements:
- class: ScatterFeatureRequirement
- class: InlineJavascriptRequirement
- class: StepInputExpressionRequirement
- $import: readPair.yml
inputs:
onePair: "readPair.yml#FilePair"
outputs:
resampled_fastq:
type: "readPair.yml#FilePair"
outputSource: collect_output/fastq_pair_out
steps:
subsample_1:
in:
fastq:
source: onePair
valueFrom: $(self.forward)
seed:
source: onePair
valueFrom: $(self.seed)
number:
source: onePair
valueFrom: $(self.number)
rep:
source: onePair
valueFrom: $(self.rep)
scatter: seed
scatterMethod: dotproduct
out: [seqtkout]
run: seqtk_sample.cwl
subsample_2:
in:
fastq:
source: onePair
valueFrom: $(self.reverse)
seed:
source: onePair
valueFrom: ${
console.log(self.seed);
return self.seed;}
number:
source: onePair
valueFrom: $(self.number)
rep:
source: onePair
valueFrom: $(self.rep)
scatter: seed
scatterMethod: dotproduct
out: [seqtkout]
run: seqtk_sample.cwl
collect_output:
run:
class: ExpressionTool
inputs:
seq_1: File
seq_2: File
id: string
outputs:
fastq_pair_out: "readPair.yml#FilePair"
expression: >
${
var ret={};
ret['forward'] = inputs.seq_1
ret['reverse'] = inputs.seq_2
ret['id'] = inputs.id
return { 'fastq_pair_out' : ret }
}
in:
seq_1: subsample_1/seqtkout
seq_2: subsample_2/seqtkout
id:
source: onePair
valueFrom: $self.id)
out:
[ fastq_pair_out ]
And, finally, seqtk_sample.cwl
actually does the work:
cwlVersion: v1.0
class: CommandLineTool
baseCommand: ['seqtk', 'sample']
stdout: $(inputs.fastq.nameroot)_$(inputs.number)_$(inputs.seed)_$(inputs.rep).fq
inputs:
seed:
type: int
inputBinding:
prefix: -s
position: 1
fastq:
type: File
inputBinding:
position: 2
number:
type: int
inputBinding:
position: 3
outputs:
seqtkout:
type: stdout
However, when I try to run the master workflow, I get the following error:
[workflow subsample] initialized from file:///Users/andersg/Documents/dev/mdu-qc-cwl/workflows/seqtk_sample_PE.cwl
[workflow subsample] workflow starting
[workflow subsample] starting step subsample_2
Unhandled exception
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/cwltool/workflow.py", line 311, in try_make_job
Callable[[Any], Any], callback), **kwargs)
File "/usr/local/lib/python2.7/site-packages/cwltool/workflow.py", line 672, in dotproduct_scatter
jo[s] = joborder[s][n]
File "/usr/local/Cellar/python/2.7.12_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ruamel/yaml/comments.py", line 502, in __getitem__
return ordereddict.__getitem__(self, key)
KeyError: 0
[workflow subsample] outdir is /var/folders/fj/s582ngbs28d78t98hf4gv0qjt74n0_/T/tmpchJbVd
Workflow cannot make any more progress.
Removing intermediate output directory /var/folders/fj/s582ngbs28d78t98hf4gv0qjt74n0_/T/tmpchJbVd
Removing intermediate output directory /var/folders/fj/s582ngbs28d78t98hf4gv0qjt74n0_/T/tmpRzqZLf
Final process status is permanentFail
I seems that I am not specifying my arrays correctly?
Any help would be greatly appreciated. Thank you.
Anders.
If you have these files in GitHub or another easy to download location than that'd make debugging easier :-)