Question

Standard simple format to describe a bioinformatics analysis pipeline

1

Entering edit mode

9.4 years ago

Laura ★ 1.8k

I want to get a collection of different groups to describe their analysis pipelines in a standard way to make it easier to see where people are doing the same thing and where they are doing different things for the same sort of analysis

I think the sort of attributes this file would need for each step in a pipeline would be

inputs, output, program, version, command line.

It would be good to have something which also states the order of the steps

I know the sra/ena analysis xml allows for at least some of this but that is quite heavy weight so I am hoping for something custom format or using json syntax so it is both human readable aswell as allowing some programatic parsing

Before I specify something myself, is there a solution already which provides most if not all the functionality I want.

ChIP-Seq alignment RNA-Seq • 3.5k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by Laura ★ 1.8k

0

Entering edit mode

A well-written Makefile should be (mostly) self-documenting to a bioinformaticist.

ADD REPLY • link 9.4 years ago by Alex Reynolds 35k

0

Entering edit mode

Can you point to an example of a MAKE file that you think is good. I think of them for building software not for running analyses

ADD REPLY • link 9.4 years ago by Laura ★ 1.8k

0

Entering edit mode

I've just written simple one: https://github.com/lindenb/ngsxml

ADD REPLY • link 9.4 years ago by Pierre Lindenbaum 161k

score 3 · Answer 1 · 2014-12-03

3

Entering edit mode

9.4 years ago

Pierre Lindenbaum 161k

hum... I'm not sure I understand. Something like a Makefile contains all the recipe and the command lines but it can be hard to read... One could imagine to build a Makefile-based worklow using a XML+a XSLT stylesheet. See my **old** example: http://plindenbaum.blogspot.fr/2012/08/the-500th-post-generating-pipeline-of.html

The very same XML descriptor could be used to write a LATEX/Markdown documentation about the workflow..

UPDATE: I wrote this https://github.com/lindenb/ngsxml

2nd idea: A Galaxy pipeline can be exported as a JSON file ... see https://wiki.galaxyproject.org/ToolShedWorkflowSharing as far as I remember the output format is JSON.

ADD COMMENT • link 9.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

This isn't necessarily meant to be something someone could use to run the pipeline but a description of the command lines used so if someone else wanted to install all the tools and rerun the process using their own pipeline infrastructure they could, or if you just want to compare how different the same steps are from different pipelines. I hope something like this will help me improve our README files for our analyses too

I trying to find an example of the galaxy json you mention but I couldn't come up with the right search terms. Do you have an example?

ADD REPLY • link 9.4 years ago by Laura ★ 1.8k

0

Entering edit mode

moved my comment to an anwer..

ADD REPLY • link 9.4 years ago by Pierre Lindenbaum 161k

Ram · Answer 2 · 2014-12-03

Here is a good overview of using makefiles for bioinformatics analyses. A makefile is a way to write out a DAG of targets and dependencies.

To write a DAG in JSON is certainly doable (see convoluted example below) but for lack of finding anything prepackaged, it has seemed to me a lot more work, since I have to design my structure of ordered targets, dependencies and parameters as a nested set of JSON objects and lists, and I had to write all the external code to validate the graph structure and components, as well as process dependencies into target end products.

JSON is more readable, but it is also more verbose ("heavyweight") as a consequence. Making sure all the bits and pieces are in the pipeline document seems a strong prerequisite. If you want to go this route, you might consider looking into JSON Schema to design a "meta"-language or schema for your graph, which can be used to help ensure individual instances of a pipeline are correct before processing. You might write a schema, and then write a JSON-formatted pipeline that validates to your schema.

Here is one very rough example of such a schema document, which defines inputs (sets of genomic intervals, essentially), operations applied on those sets to create outputs, and a vocabulary of properties and parameters that might be useful for staging and processing (datetime stamp, ID fields, descriptive metadata, etc.):

The following is an example of a JSON-formatted instance of a processing pipeline, which would validate against this schema. The goal is to show a graph that would take transcription start sites, filter them for belonging to the CTCF factor, and then apply the equivalent of a bedmap operation against them and a list of promoter windows of interest on chromosome 16:

It would be the job of whatever service parses this JSON request or payload to decide which genomic sets are inputs that exist (dependencies) and which are targets, yet to be made, which require backend processing steps.

There are various libraries written to process JSON and validate JSON against a JSON Schema document. In Python, for instance:

$ python
Python 2.7.6 (default, Jul  9 2014, 20:49:24) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.38)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> from jsonschema import validate
>>> schema_fh = open("BEDOPSWebRequestSchema.json", "r")
>>> schema = json.load(schema_fh)
>>> test_request_fh = open("SampleWebRequestPayload.json", "r")
>>> test_request = json.load(test_request_fh)
>>> validate(test_request, schema)

If the request doesn't validate, a ValidationError exception is thrown with errors that point to the offending JavaScript object in the request. If the request validates, that doesn't mean there couldn't be problems with the schema, but it's a good start for testing and validation.

Maybe there is a suite of tools written that do all of this already, but I wasn't able to find one. Hopefully someone more knowledgeable will comment, or hopefully this post gives some ideas of what could potentially be done.

GNU Makefile seems to do a lot of the heavy lifting and the tools to process one are ubiquitous on UNIX systems like Linux and OS X, so it is perhaps reinventing the wheel to translate this system to another language. I mean, all that JSON above can basically be reduced to something like:

$ grep 'CTCF' TSS.bed | bedmap --chrom chr16 promoters.bed - > answer.bed

A makefile from this is not very long or complex to read:

all: ctcf_tss.bed answer.bed

ctcf_tss.bed:
    grep 'CTCF' TSS.bed > $@

answer.bed: ctcf_tss.bed
    bedmap --chrom chr16 promoters.bed $^ > $@

Further, if any dependency changes (say, the set of transcription start sites changes) then only targets downstream of changed dependencies get remade, which is more efficient. This could be done with a JSON-based approach, but it requires coding.