Matching files by chromosome during wdl scatter gather
Entering edit mode
4.2 years ago
Vivek ★ 2.7k


I'm trying to implement a polygenic score pipeline in WDL and I'm quite new to this pipeline management.

The first step of the pipeline takes a file of GWAS summary statistics and splits it by chromosome.

The corresponding wdl task:

task split {
    input {
        File gwas
        String output_prefix

    command {
        ./splitGwas -i ${gwas} -o ${output_prefix}

    output {
        Array [File] gwas_by_chr = glob("${output_prefix}_*.assoc")

The next part of the process is to compute posterior effects of SNPs by chromosome. The inputs to this task are going to be the files split by chromosome in the previous step and an LD matrix for the chromosome.

The corresponding wdl call would be something like this:

Array [File] ld_matrices = read_lines(file_of_ld_matrices_by_chr)

scatter(pair in zip(split.gwas_by_chr, ld_matrices)) {
    call {
            GCTB = gctb_executable_path,
            gwas = pair.left,
            ld_matrix = pair.right,
            output_prefix = out

Since I'm grabbing the split files using glob in the previous step, I do not particularly know which element in the array corresponds to which chromosome. So I would be making a mistake if I zip the summary stats for one chromosome with the LD matrix corresponding to a different chromosome.

The programmatic way would be to iterate the scatter over chromosome name but then I would lose the ability to specify the dependency between the two tasks.

Is there a better way to do this?


wdl scatter • 1.8k views

Login before adding your answer.

Traffic: 3224 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6