Tool: Luigi's Monkey Wrench - A small helper library for commandline heavy bioinformatics workflows in Spotify's Luigi workflow tool
gravatar for Samuel Lampa
5.7 years ago by
Samuel Lampa1.2k
Samuel Lampa1.2k wrote:

Github repository:

From the Github README:

Luigi's monkey wrench is a small library (50 LOC exactly, as of Feb 12) that intends to make writing Luigi workflows that use a lot of shell commands (which is common e.g. in bioinformatics) a tad easier by allowing to define workflow tasks with a simple shell command pattern, and dependencies by using a simple single-assignment patter for specifying how tasks inputs depend on each other's outputs, like so:

import luigi
from luigis_monkey_wrench import *

class MyWorkFlow(WorkflowTask):
    def requires(self):
        # Create some tasks
        hejer = shell('echo hej > <o:hejfile:hej.txt>')
        fooer = shell('cat <i:hejfile> | sed "s/hej/foo/g" > <o:foofile:<i:hejfile:.txt|.foo>>')

        # Connect them together
        fooer.inports['hejfile'] = hejer.outport('hejfile')

        # Return the last one in the chain
        return fooer

# Make this a runnable script, and leave control to luigi
if __name__ == '__main__':

Short and neat, ain't it?

But let's go though this example in a bit more detail, to see what we are really doing:

import luigi
from luigis_monkey_wrench import *

# Yes, we write the workflow definition inside a normal luigi task ...
class MyWorkFlow(WorkflowTask):
    # ... and do this by setting up the dependency graph and (letting the workflow
    # task depend on it, by) returning the last task in the dependecy graph in the
    # workflow task's requires() function:
    def requires(self):
        # Create tasks by initializing ShellTasks, and giving
        # the shell tasks to execute to the cmd parameter.
        # File names are given in a this special form (including <>):
        #   <i:INPUT_NAME>
        # Output file names can also include the filename of an input:
        #   <o:some_output:<i:some_input>.some_extension>
        # One can also just replace the extension, or ending, of the input
        # filename, in the output file name, using the following syntax:
        # E.g, to create <filename>.csv as output from <filename>.txt, we do:
        #   <o:some_output:<i:some_input:.txt|.csv>>
        hejer = shell('echo hej > <o:hejfile:hej.txt>')
        fooer = shell('cat <i:hejfile> | sed "s/hej/foo/g" > <o:foofile:<i:hejfile:.txt|.foo>>')

        # Define the workflow "dependency graph" by telling how outputs
        # from tasks are re-used in inputs of other tasks
        fooer.inports['hejfile'] = hejer.outport('hejfile')

        # Return the last task in the workflow
        return fooer

# We finally make this file into an executable python file, and let luigi take of the running
# which will, among many other cool things, mean that we get a nice command line interface
# generated for us:
if __name__ == '__main__':

Now run this (as usual with luigi tasks) like this:

python --local-scheduler MyWorkFlow

Quick start

Install the dependencies, luigi (and optionally tornado):

pip install luigi
pip install tornado

Clone this git repo to somewhere:

mkdir testlmw
cd testlmw
git clone .

Run the example script (or one that you have already)

python --local-scheduler MyWorkFlow

Current Status: Experimental

Use on your own risk only!

A "Real-World" NGS Bioinformatics code example:

ADD COMMENTlink modified 5.7 years ago • written 5.7 years ago by Samuel Lampa1.2k

What would be the advantage of using this over, say, a GNU makefile-based graph or pipeline?

ADD REPLYlink written 5.7 years ago by Alex Reynolds31k

A few things we noted:

  1. It uses Luigi, which we are already using :) 
  2. Luigi has Hadoop support built-in (so shell based tasks are interoperable with Hadoop tassks)
  3. Make and make-inspired tools like snakemake (python), works in our experience rather backwards, in that you typically need to ask for the most downstream target by specifying the pattern for the file names of that target. This can sometimes be tricky with long dependency graphs that change frequently. In our approach you don't need to think about the final file names, but they will fall out "automatically", from the addition / replacement to the filename you do in each step.
  4. Again, with dependency graphs that change frequently, when you want to add a target in the middle of a long series of sequential processing steps, it seems to us that you typically have to modify the target pattern of each downstream step (I would be interested to see if it is possible to go around that though). Dynamically generated file extensions, makes this a no-brainer.
  5. Time-stamp based difference detection does not seem like the correct pattern for what we do. In luigi the approach is instead to encode each unique parameter setting through some unique file name pattern (e.g. a new raw dataset version would "tag" all downstream files with a certain pattern). We are working on including that thinking into the approach taken in the tool above.
  6. The time-stamp based, and could cause problems when working across multiple disparate file systems at a HPC center (not all systems implement POSIX handling 100% equally).
  7. Some various niceties of luigi, such as the graphical visualizer (See the luigi github README for example), and the flexibility of luigi being a (python) library rather than a rigid tool, enabling to mold it according to needs ...
ADD REPLYlink modified 5.7 years ago • written 5.7 years ago by Samuel Lampa1.2k

Btw, I posted a little bit about the background to creating this helper lib (furiously trying to jot down NGS bioinformatics tasks in any workflow tool that would do the job ... tried snakemake and vanilla luigi to no avail), in the luigi user group:

ADD REPLYlink written 5.7 years ago by Samuel Lampa1.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 940 users visited in the last hour