Question: How to handle metadata in bioinformatics pipelines?
gravatar for James Ashmore
15 months ago by
James Ashmore2.0k
UK/Edinburgh/MRC Centre for Regenerative Medicine
James Ashmore2.0k wrote:

After reading a review of bioinformatics pipeline frameworks I've started testing a few to see if I could create a simple ChIP-seq peak-calling pipeline. However I always seem to run into the same problem - what format should I put my metadata in, and how can I specify sample comparisons (for example ChIP-factor against naked DNA) such that the pipeline is flexible to different numbers of samples and comparisons?

With Snakemake you can provide a .yaml or .json file to describe the experiment and the comparisons, but in a large study doing this by hand is quite tedious (I wonder if there is a nice way to parse a .tsv sample design file into a .yaml or .json configuration?). Alternatively if I were to use GNU Make all of the comparisons would have to either be specified manually in the Makefile, or a single rule would be used to call a script which runs the comparisons by parsing a sample design file. In the latter case I can't take advantage of Make's parallel jobs feature and would have to use the parallel processing modules available to the scripting language.

I feel that whatever pipeline I wrote with these frameworks would continuously have to be changed or updated to accommodate new numbers of samples or comparisons. When really all I want is to be able to provide a new sample design file and have the pipeline adapt to the new design without too much extra tweaking. Maybe I'm expecting too much, but is there a particular representation of metadata you've found works well, or could you recommend a different framework which handles metadata and comparisons directly?

pipeline snakemake make metadata • 1.1k views
ADD COMMENTlink modified 15 months ago by Jeremy Leipzig17k • written 15 months ago by James Ashmore2.0k

In the case of snakemake, you can get pretty far with parsing a TSV file and using conditionals in the input/output/params section. I've yet to see a perfect solution for metadata handling, though.

ADD REPLYlink written 15 months ago by Devon Ryan70k

After a bit more testing I agree that parsing a TSV file is probably the best way to handle metadata until a framework with integrated support comes along. For now I'll use a script I wrote to convert the sample sheet into yaml format which Snakemake understands.

ADD REPLYlink written 15 months ago by James Ashmore2.0k

I'm not sure I understand exactly what you mean with metadata. Could you clarify that a little? Do you mean a description of the outputs you want the workflow to produce?

Part of your description sounds like something I've been ranting before, about the need for dynamic workflow scheduling, which means you can do computations where the number of tasks (be it comparisons) is determined based on the output of an earlier task in the workflow. That is typically possible with data-flow based systems, and so I would expect it to be possible in nextflow, and also in my own experimental library scipipe, where the concrete need for dynamic scheduling was the whole motivation behind writing it.

ADD REPLYlink modified 15 months ago by Ram12k • written 15 months ago by Samuel Lampa1.1k

In this case, I would interpret metadata to be things likes groups and control samples that are sometimes, but not always, be used for normalizations. The classic example would be ChIPseq, where different ChIPs will have different controls. In an ideal world these would have some consistent naming scheme, but on really large projects that may not be a reasonable assumption.

ADD REPLYlink written 15 months ago by Devon Ryan70k

A database with a good API would be a good solution.

ADD REPLYlink written 15 months ago by Jean-Karim Heriche13k

Hi James,

Sorry to be late to this discussion, but I couldn't find any other way to contact you.

I am not quite clear on one thing: by "Number of samples" do mean number of input files to be merged for one sample? Because for a Chip-factor vs. input, you're not talking multiple actual samples, right? Or are you talking about something like ChIP differential peak analysis which uses multiple samples? Anyway, I'm not quite clear on the question but I think you could do this pretty easily with a new system I'm working on called looper... for simple comparisons, you would define a tsv with one line per comparison, and then write a pipeline that runs on the tsv. we use a merge table you define (another tsv) to allow you to define any number of inputs in either category.

It's still early days and I've been trying to think of ideas for how to do something like this better. So since you've been thinking about it, if this doesn't solve your issues, I'm all ears.

ADD REPLYlink written 14 months ago by nathan0
gravatar for Jeremy Leipzig
15 months ago by
Philadelphia, PA
Jeremy Leipzig17k wrote:

I agree with the concerns of the OP and I think Snakemake and other frameworks should have some more built-in tools to deal with metadata (other than rolling your own lookups)

Snakemake and other implicit frameworks will also suffer when all the filenames are UUIDs (as we are seeing in TCGA/GDC although they have been kind enough to preserve suffixes so far), so a robust system for filename transformation would also be nice.

I imagine one of the goals of SUSHI was to formalize this kind of thing by encouraging people to do this from the start. I hope there might be some way some of those principles can boil down to frameworks that aren't full-blown Rails apps.

In addition to CWL, it's worth looking at WINGS

ADD COMMENTlink modified 8 weeks ago • written 15 months ago by Jeremy Leipzig17k
gravatar for igor
15 months ago by
United States
igor4.5k wrote:

Look into Common Workflow Language:

The Common Workflow Language (CWL) is an informal, multi-vendor working group consisting of various organizations and individuals that have an interest in portability of data analysis workflows. Our goal is to create specifications that enable data scientists to describe analysis tools and workflows that are powerful, easy to use, portable, and support reproducibility.

All these pipelines are using that:

ADD COMMENTlink modified 15 months ago • written 15 months ago by igor4.5k

yes nice support for ontologies to define input/output formats in CWL

cwlVersion: v1.0
class: CommandLineTool

  - class: SoftwareRequirement
        specs: [ "" ]
        version: [ "1.65", "1.66", "1.69" ]

    type: File
    format: edam:format_1930  # FASTQ

baseCommand: [ python ]

  - valueFrom: |
      from Bio import SeqIO; SeqIO.convert("$(inputs.fastq.path)", "fastq", "$(inputs.fastq.basename).fasta", "fasta");
    prefix: -c

    type: File
    outputBinding: { glob: $(inputs.fastq.basename).fasta }
    format: edam:format_1929  # FASTA


s:license: ""
s:copyrightHolder: "EMBL - European Bioinformatics Institute"
ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by Jeremy Leipzig17k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 655 users visited in the last hour