After reading a review of bioinformatics pipeline frameworks I've started testing a few to see if I could create a simple ChIP-seq peak-calling pipeline. However I always seem to run into the same problem - what format should I put my metadata in, and how can I specify sample comparisons (for example ChIP-factor against naked DNA) such that the pipeline is flexible to different numbers of samples and comparisons?
With Snakemake you can provide a .yaml or .json file to describe the experiment and the comparisons, but in a large study doing this by hand is quite tedious (I wonder if there is a nice way to parse a .tsv sample design file into a .yaml or .json configuration?). Alternatively if I were to use GNU Make all of the comparisons would have to either be specified manually in the Makefile, or a single rule would be used to call a script which runs the comparisons by parsing a sample design file. In the latter case I can't take advantage of Make's parallel jobs feature and would have to use the parallel processing modules available to the scripting language.
I feel that whatever pipeline I wrote with these frameworks would continuously have to be changed or updated to accommodate new numbers of samples or comparisons. When really all I want is to be able to provide a new sample design file and have the pipeline adapt to the new design without too much extra tweaking. Maybe I'm expecting too much, but is there a particular representation of metadata you've found works well, or could you recommend a different framework which handles metadata and comparisons directly?