What Is Your Method For Managing Data Provenance?
2
11
Entering edit mode
13.7 years ago

Any moderately complex analysis uses data from multiple sources, some of which may be more trustworthy than others. A data source may be known to have problems, but be useful nevertheless, or its reliability may be unknown. Any results that touch bad data may become tainted, so how do you manage provenance in your own analyses? How do you communicate this to consumers of your data or code?

The accepted answer will need to be specific. Here are some hypothetical answers:

  • All data sources and versions are recorded in a README file stored with the analysis scripts and results.

  • Literate programming is used with data sources and versions recorded in the document.

  • New versions of data sources are regression tested with a set of standard analyses prior to use, to ensure that they give the expected results.

  • Within the analysis pipelines all data are annotated with a reliability score. If a function is called on data values with differing scores, the return value is assigned the mean/median/minimum of the input reliability scores.

  • Stock trading algorithms are used to annotate result reliability in realtime, based on customer feedback (!)

I'm interested because I'm writing analysis pipelines and would like to introduce automatic, fine-grained provenance tracking. Using Clojure metadata is one possible route to achieve this.

pipeline workflow annotation • 2.9k views
ADD COMMENT
6
Entering edit mode
13.7 years ago

The approaches I usually use are:

  1. Freeze every input database or dataset to record exactly what was used as input.
  2. Perform standard benchmarks of new data sources to have a handle on their quality.
  3. Use reliability scores derived from benchmarking throughout the analysis pipelines.
ADD COMMENT
0
Entering edit mode

Not many steps to remember, easy to achieve and effective. Nice answer.

ADD REPLY
3
Entering edit mode
13.7 years ago
Nathan Harmston ★ 1.1k

Hi,

So I'm a big fan of relational databases (for better or for worse) for this kind of thing. I typically try to store as much of my input, the majority of intermediate output and final output in a database (SQLite or MySQL depending on the size). This means that for the input data tables I typically have an associated source record which contains the version number of the file, maybe the readme as a text field, the url where I got the data from, the date I retrieved the file etc.

Then I create an audit trail for all of the intermediate output and final output and store this in a database aswell. Storing stuff like name, version number, date etc. So I can trace back the steps that led to a certain set of outputs.

I use SQLAlchemy (with Python I get SQLAlchemy, Rpy and ctypes :D - awesome for combining pipelines) a lot so I just alter the mappers for my ORM to retrieve specific data from specific versions which generates output from a specific pipeline.

Although it took me a while to get this set-up and I still have things to do to make it a lot better butI really like this approach. I've used it for comparative genomics, managing microarray metadata and for some text mining work.

In the end the consumer is however ... me.

I m not sure how relevant this is to your plans though.

HTH

ADD COMMENT
0
Entering edit mode

Thanks for your detailed answer. Even if you are the sole consumer, it's reassuring to know you'll be able to retrace everything a year or more later.

ADD REPLY

Login before adding your answer.

Traffic: 2576 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6