Question: Combine subject with sample type in a scatter?
0
gravatar for alanh
16 days ago by
alanh50
alanh50 wrote:

I have CWL that runs a pair of tumor-normal samples for a given subject.

For later variant calling, I want to add the read names to be something like $(subjectName)_$(fastqs.sample)

The inputs are like this:

inputs:
  fastqs:
    type:
      type: array
      items:
        type: record
        fields:
          - name: sample
            type: string
          - name: files
            type:
              type: array
              items: File
  referenceFasta:
    type: File
  subjectName:
    type: string

steps:

  dna_align_and_sort:

    run align_sort.cwl
    in:
      reference_fasta: referenceFasta
      fastq_files:
        source: fastqs
        valueFrom: $(self.files)
      sample_name:
        source: fastqs
        valueFrom: ${MAGIC LINE HERE)  # <---- WHAT GOES HERE?
    out:
      [fileInDir]
    scatter: [fastq_files, sample_name]
    scatterMethod: dotproduct

Can someone tell me what should go into the sample_name thing to make this work? \

I have successfully inserted the fastqs.sample as the name using $(self.sample) so I know the underlying code works.

cwl • 170 views
ADD COMMENTlink modified 7 days ago • written 16 days ago by alanh50

Can you elaborate on the kind of problem your example causes? Is it related to the scattering or the referencing of subjectName in the context of the step?

ADD REPLYlink written 15 days ago by Tom240

In a later step, the Mutect2 somatic variant caller seems to name the FORMAT data column in its output VCF using the value in the read. In my above example, if the "$(self.sample)" is either "tumor" or "normal" depending on the sample type. The reads get named "tumor" or "normal" based on that.

The problem occurs after that when I try to build a panel of normals (PON) from the normal samples, and if they're all named the same thing (Normal, Normal, Normal, etc), the PON creation step barfs because they're all the same names.

ADD REPLYlink written 15 days ago by alanh50

Can adding the subjectName solve this? Only a single subject name is given to the workflow. Wouldn't they still all have the same (albeit longer) name?

Do the fastq files have unique names? If so, you could add their nameroot to the sample names to distinguish between them.

ADD REPLYlink modified 14 days ago • written 14 days ago by Tom240
1

I've tried a bunch of iterations here:
valueFrom: "$(subjectName)_$(self.sample)" valueFrom: ${return inputs.subjectName.concat("_",fastqs.sample)}

and they all fail with various issues.

ADD REPLYlink written 14 days ago by alanh50
1

Try adding "subjectName" to inputs and then referring to it this way:

valueFrom: "$(inputs.subjectName)_$(self.sample)"
ADD REPLYlink written 9 days ago by peter.amstutz280
2
gravatar for Tom
10 days ago by
Tom240
Bielefeld University, CeBiTec, Germany
Tom240 wrote:

Regarding the issues with valueFrom: ${return inputs.subjectName.concat("_",fastqs.sample)} and similar constructs:

inputs will in this context not reference the workflows inputs, but the inputs of the step. Looking at the valueFrom section in this segment of the specifications leads me to believe there is no way to reference the subjectName-input of the workflow in the StepInputExpression. You would have to pass subjectName to your align_sort.cwl-Tool and concatenate the names there.

Another option would be this horrid workaround i use:

Add in input parameter to your align_sort.cwl

[...]   
  inputs:
      namesource:
        type: string? #This might also be File etc.
[...]

You don't give it an input binding, so the tool itself will never use it. It's optional, so everything will run fine if you don't provide it to the tool. But you can pass subjectName to the workflow step as in input parameter and reference it in the javascript expression:

[...]
steps:
  dna_align_and_sort:
    run: align_sort.cwl
    in:
      namesource: subjectName
      sampleName: 
        valueFrom: $((inputs.namesource)+ (whatever.you.like))
[...]
ADD COMMENTlink modified 8 days ago • written 10 days ago by Tom240

How about this:

  sampleName: 
    valueFrom: |
      ${
        if (inputs.subjectName) {
          return inputs.subjectName + "_" + inputs.sampleName
        } else {
          return inputs.sampleName
        }
      }
ADD REPLYlink modified 10 days ago • written 10 days ago by alanh50

As i mentioned i: I don't think it is possible to reference inputs.subjectName unless subjectName is an input of the WorkflowStep. Also why reference inputs.sampleName inside of the expressions that is supposed to yield inputs.sampleName? I feel like i'm fundamentally misunderstanding what is to be accomplished here.

ADD REPLYlink written 8 days ago by Tom240

So, trying to provide a more precise solution. You can get a combination of the workflow input subjectName and the sample field of the fastqs array (as demanded in the opening post) by doing the following: Add this to the inputs section of align_sort.cwl:

[...]
  namesource:
    type: string?
[...]

Then modify the steps section of the workflow:

[...]
steps:

  dna_align_and_sort:

    run align_sort.cwl
    in:
      namesource: subjectName
      reference_fasta: referenceFasta
      fastq_files:
        source: fastqs
        valueFrom: $(self.files)
      sample_name:
        source: fastqs
        valueFrom: $((inputs.namesource)+"_"+(self.sample))
    out:
      [fileInDir]
    scatter: [fastq_files, sample_name]
    scatterMethod: dotproduct
[...]

This should work from a technical perspective and achieve what you asked in the opening post. But since subjectName is only a single string and you said the sample array is always just "tumor" or "normal" i doubt doing this will solve the problem. From my understanding, you need individual names for each sample.

How are the fastq-files named? Maybe you could also add something like +"_"+(self.files.nameroot) to the end of the sampleName strings to distinguish them.

ADD REPLYlink modified 8 days ago • written 8 days ago by Tom240
1

Thanks, this answered my question and it was hard to shift the contexts in my head.

ADD REPLYlink written 7 days ago by alanh50
2
gravatar for alanh
7 days ago by
alanh50
alanh50 wrote:

To specifically put the answer in context, the correct method is to do the following:

steps:
  dna_align_and_sort:
    run: align_sort.cwl
    in:
      reference_fasta: referenceFasta
      fastq_files:
        source: fastqs
        valueFrom: $(self.files)
      subject_name:  subjectName  # THIS brings the subjectName value into local context
      sample_name:
        source: fastqs
           # inputs.subject_name below refers to the "# THIS ..." line above
        valueFrom: "$(inputs.subjectName)_$(self.sample_name)" 
    out: 
      [fileInDir]
    scatter: [fastq_files, sample_name]
    scatterMethod: dotproduct
ADD COMMENTlink modified 7 days ago • written 7 days ago by alanh50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1641 users visited in the last hour