Question: Combine subject with sample type in a scatter?
0
gravatar for alanh
7 months ago by
alanh80
alanh80 wrote:

I have CWL that runs a pair of tumor-normal samples for a given subject.

For later variant calling, I want to add the read names to be something like $(subjectName)_$(fastqs.sample)

The inputs are like this:

inputs:
  fastqs:
    type:
      type: array
      items:
        type: record
        fields:
          - name: sample
            type: string
          - name: files
            type:
              type: array
              items: File
  referenceFasta:
    type: File
  subjectName:
    type: string

steps:

  dna_align_and_sort:

    run align_sort.cwl
    in:
      reference_fasta: referenceFasta
      fastq_files:
        source: fastqs
        valueFrom: $(self.files)
      sample_name:
        source: fastqs
        valueFrom: ${MAGIC LINE HERE)  # <---- WHAT GOES HERE?
    out:
      [fileInDir]
    scatter: [fastq_files, sample_name]
    scatterMethod: dotproduct

Can someone tell me what should go into the sample_name thing to make this work? \

I have successfully inserted the fastqs.sample as the name using $(self.sample) so I know the underlying code works.

cwl • 365 views
ADD COMMENTlink modified 7 months ago • written 7 months ago by alanh80

Can you elaborate on the kind of problem your example causes? Is it related to the scattering or the referencing of subjectName in the context of the step?

ADD REPLYlink written 7 months ago by Tom520

In a later step, the Mutect2 somatic variant caller seems to name the FORMAT data column in its output VCF using the value in the read. In my above example, if the "$(self.sample)" is either "tumor" or "normal" depending on the sample type. The reads get named "tumor" or "normal" based on that.

The problem occurs after that when I try to build a panel of normals (PON) from the normal samples, and if they're all named the same thing (Normal, Normal, Normal, etc), the PON creation step barfs because they're all the same names.

ADD REPLYlink written 7 months ago by alanh80

Can adding the subjectName solve this? Only a single subject name is given to the workflow. Wouldn't they still all have the same (albeit longer) name?

Do the fastq files have unique names? If so, you could add their nameroot to the sample names to distinguish between them.

ADD REPLYlink modified 7 months ago • written 7 months ago by Tom520
1

I've tried a bunch of iterations here:
valueFrom: "$(subjectName)_$(self.sample)" valueFrom: ${return inputs.subjectName.concat("_",fastqs.sample)}

and they all fail with various issues.

ADD REPLYlink written 7 months ago by alanh80
1

Try adding "subjectName" to inputs and then referring to it this way:

valueFrom: "$(inputs.subjectName)_$(self.sample)"
ADD REPLYlink written 7 months ago by peter.amstutz300
2
gravatar for Tom
7 months ago by
Tom520
Bielefeld University, CeBiTec, Germany
Tom520 wrote:

Regarding the issues with valueFrom: ${return inputs.subjectName.concat("_",fastqs.sample)} and similar constructs:

inputs will in this context not reference the workflows inputs, but the inputs of the step. Looking at the valueFrom section in this segment of the specifications leads me to believe there is no way to reference the subjectName-input of the workflow in the StepInputExpression. You would have to pass subjectName to your align_sort.cwl-Tool and concatenate the names there.

Another option would be this horrid workaround i use:

Add in input parameter to your align_sort.cwl

[...]   
  inputs:
      namesource:
        type: string? #This might also be File etc.
[...]

You don't give it an input binding, so the tool itself will never use it. It's optional, so everything will run fine if you don't provide it to the tool. But you can pass subjectName to the workflow step as in input parameter and reference it in the javascript expression:

[...]
steps:
  dna_align_and_sort:
    run: align_sort.cwl
    in:
      namesource: subjectName
      sampleName: 
        valueFrom: $((inputs.namesource)+ (whatever.you.like))
[...]
ADD COMMENTlink modified 7 months ago • written 7 months ago by Tom520

How about this:

  sampleName: 
    valueFrom: |
      ${
        if (inputs.subjectName) {
          return inputs.subjectName + "_" + inputs.sampleName
        } else {
          return inputs.sampleName
        }
      }
ADD REPLYlink modified 7 months ago • written 7 months ago by alanh80

As i mentioned i: I don't think it is possible to reference inputs.subjectName unless subjectName is an input of the WorkflowStep. Also why reference inputs.sampleName inside of the expressions that is supposed to yield inputs.sampleName? I feel like i'm fundamentally misunderstanding what is to be accomplished here.

ADD REPLYlink written 7 months ago by Tom520

So, trying to provide a more precise solution. You can get a combination of the workflow input subjectName and the sample field of the fastqs array (as demanded in the opening post) by doing the following: Add this to the inputs section of align_sort.cwl:

[...]
  namesource:
    type: string?
[...]

Then modify the steps section of the workflow:

[...]
steps:

  dna_align_and_sort:

    run align_sort.cwl
    in:
      namesource: subjectName
      reference_fasta: referenceFasta
      fastq_files:
        source: fastqs
        valueFrom: $(self.files)
      sample_name:
        source: fastqs
        valueFrom: $((inputs.namesource)+"_"+(self.sample))
    out:
      [fileInDir]
    scatter: [fastq_files, sample_name]
    scatterMethod: dotproduct
[...]

This should work from a technical perspective and achieve what you asked in the opening post. But since subjectName is only a single string and you said the sample array is always just "tumor" or "normal" i doubt doing this will solve the problem. From my understanding, you need individual names for each sample.

How are the fastq-files named? Maybe you could also add something like +"_"+(self.files.nameroot) to the end of the sampleName strings to distinguish them.

ADD REPLYlink modified 7 months ago • written 7 months ago by Tom520
1

Thanks, this answered my question and it was hard to shift the contexts in my head.

ADD REPLYlink written 7 months ago by alanh80
2
gravatar for alanh
7 months ago by
alanh80
alanh80 wrote:

To specifically put the answer in context, the correct method is to do the following:

steps:
  dna_align_and_sort:
    run: align_sort.cwl
    in:
      reference_fasta: referenceFasta
      fastq_files:
        source: fastqs
        valueFrom: $(self.files)
      subject_name:  subjectName  # THIS brings the subjectName value into local context
      sample_name:
        source: fastqs
           # inputs.subject_name below refers to the "# THIS ..." line above
        valueFrom: "$(inputs.subjectName)_$(self.sample_name)" 
    out: 
      [fileInDir]
    scatter: [fastq_files, sample_name]
    scatterMethod: dotproduct
ADD COMMENTlink modified 7 months ago • written 7 months ago by alanh80
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1860 users visited in the last hour