Matching files by chromosome during wdl scatter gather
0
1
Entering edit mode
19 months ago
Vivek ★ 2.5k

Hi,

I'm trying to implement a polygenic score pipeline in WDL and I'm quite new to this pipeline management.

The first step of the pipeline takes a file of GWAS summary statistics and splits it by chromosome.

task split {
input {
File gwas
String output_prefix
}

command {
./splitGwas -i ${gwas} -o${output_prefix}
}

output {
Array [File] gwas_by_chr = glob("\${output_prefix}_*.assoc")
}
}


The next part of the process is to compute posterior effects of SNPs by chromosome. The inputs to this task are going to be the files split by chromosome in the previous step and an LD matrix for the chromosome.

The corresponding wdl call would be something like this:

Array [File] ld_matrices = read_lines(file_of_ld_matrices_by_chr)

scatter(pair in zip(split.gwas_by_chr, ld_matrices)) {
call sbayes.run {
input:
GCTB = gctb_executable_path,
gwas = pair.left,
ld_matrix = pair.right,
output_prefix = out
}
}


Since I'm grabbing the split files using glob in the previous step, I do not particularly know which element in the array corresponds to which chromosome. So I would be making a mistake if I zip the summary stats for one chromosome with the LD matrix corresponding to a different chromosome.

The programmatic way would be to iterate the scatter over chromosome name but then I would lose the ability to specify the dependency between the two tasks.

Is there a better way to do this?

Cheers!

wdl scatter • 957 views