Question: Matching files by chromosome during wdl scatter gather
1
gravatar for Vivek
7 months ago by
Vivek2.4k
Denmark
Vivek2.4k wrote:

Hi,

I'm trying to implement a polygenic score pipeline in WDL and I'm quite new to this pipeline management.

The first step of the pipeline takes a file of GWAS summary statistics and splits it by chromosome.

The corresponding wdl task:

task split {
    input {
        File gwas
        String output_prefix
    }

    command {
        ./splitGwas -i ${gwas} -o ${output_prefix}
    }

    output {
        Array [File] gwas_by_chr = glob("${output_prefix}_*.assoc")
    }
}

The next part of the process is to compute posterior effects of SNPs by chromosome. The inputs to this task are going to be the files split by chromosome in the previous step and an LD matrix for the chromosome.

The corresponding wdl call would be something like this:

Array [File] ld_matrices = read_lines(file_of_ld_matrices_by_chr)

scatter(pair in zip(split.gwas_by_chr, ld_matrices)) {
    call sbayes.run {
        input:
            GCTB = gctb_executable_path,
            gwas = pair.left,
            ld_matrix = pair.right,
            output_prefix = out
    }
}

Since I'm grabbing the split files using glob in the previous step, I do not particularly know which element in the array corresponds to which chromosome. So I would be making a mistake if I zip the summary stats for one chromosome with the LD matrix corresponding to a different chromosome.

The programmatic way would be to iterate the scatter over chromosome name but then I would lose the ability to specify the dependency between the two tasks.

Is there a better way to do this?

Cheers!

wdl scatter • 332 views
ADD COMMENTlink modified 7 months ago • written 7 months ago by Vivek2.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1768 users visited in the last hour