Provenance Best Practices

From ALPS
Revision as of 00:09, 16 October 2013 by Troelsfr (talk | contribs)

Jump to: navigation, search

During the ETH Provenance Challenge we identify some "Best practices" in the production of provenance-rich scientific work.

Minimal requirement:

  • use version control for sources and scripts
    • commit often
    • write descriptive, but concise commit messages
  • store the revision number/repository state
  • store input parameters (incl. random seeds) used to obtain the data
  • create a directory per figure containing relevant scripts
  • store the numbers for the data in the plot in an accompanying text file
  • upload raw output
  • describe the post-processing procedure that turns raw data into plotted values


Additional features:

  • store build information
    • store branch, revision number, build time and node.
    • any data output should have attributes from where this information can be recovered (i.e. headers of text file, or attibutes in hdf5)
  • store runtime settings
    • store command line arguments, runtime and node
  • link figures to evaluation scripts and data
    • if you get the PDF figure, can you go back to the version of code and parameters used in the simulation?

Compiling code with provenance from Git repository

This is example shows how to add git repository information such as branch and revision into your code. It is easily portable to CMake and Subversion.

'Makefile:

BUILDHEADER=/tmp/buildheader.info
BUILDSTAMP="\"`cat ${BUILDHEADER} | head -n 1`\""
FLAGS = -O3 -DBUILD_STAMP=${BUILDSTAMP} 
 
buildheader:
	command -v git >/dev/null 2>&1 &&  echo "Build date" `date +'%y.%m.%d %H:%M:%S'` "NL"  "Branch: " `git rev-parse --abbrev-ref HEAD` "NL" "Hash: " `git rev-parse HEAD` "" > ${BUILDHEADER}
 
program: buildheader
        c++ ${FLAGS} -o program program.cpp

program.cpp:

#include<iostream>
 
int main() {
    std::cout << "Save the macro BUILD_STAMP with your data." << std::endl;
    std::cout << BUILD_STAMP << std::endl;
    return 0;
}