This is a summarized transcript of the group discussion.
Data format roundup
Under discussion are: XML/HDF5
earlier proposal by David Ceperley (see separate page/document)
XML disadvantage: Large tags, take too much space. text file disadvantage: small
XML should be based on standardized tags in the form <average value="energy" mean="1200" error="52"/> instead of <energy average="..." mean="..." />
The choice was made in favour of HDF5
Equilibration checks: 3 times autocorrelation time; start with different initial conditions, check whether means match after runs.
Wang-Landau is a special case (evaluation, reweighting etc.)
Single MC runs vs. multiple MC runs; multiple MCs needs averaging and non-linear operations on measurements (binder cumulants etc.) Should be done in a scripted language (so not the way its done ALPS at the moment).
DB backend is a requirement. ALPS already provides this, but only for its own format (could be extended). David's group have an OpenGL visualization tool for path integral codes ready. It should probably be extended to a more general application.
Disorder averaging is also required (ie. for spin systems). This usually ives a large amount of different files which need to be analyzed.
Plotting tools: Use the stuff that is around. Sophisticated enough. Just need the interface between plotting tool and the backend (DB/flat files). Also implemented by Lukas' implementation (scalar average from vectors). This transformation should be done in higher-level languages (Python etc.)
DB backend: it's not yet known how good the current (ie. Lukas') implementation scales. Needs to be tested.
More intelligent compression.
Which language do we use? One point is: Does the language really matter since data is in XML? Nevertheless, provide a simple interface to common tasks in C++/Fortran/Python/Java.
What happens to large junks of data that cannot be easily moved away from the nodes (Blue Gene situation)?
Developping DB backends: NCSA has some DB experts, as does ETH, but they don't seem to be interested, unfortunately.
What should be stored in the files?
See separate document.
We need 'dictionaries' to, for example, specify what one means by "Heisenberg model". Requires an external server/Wiki.
MC data that should be stored
More detail in separate document
- Binned time series: constant and exponential bin width; also include basic evaluation.
- Is it called bin or block? Most probably both should be possible to use, but which one is the prefered nomenclature? - How do we call runs belonging to the same simulation? Clones, replicas, scans, realizations? Maybe self-defined, if the unit of a 'run' is defined. Scans are a set of simulations with a few params varied systematically. - What meta information can be standardized? Filter date from one kind of sim into another. - How to store dynamical variables (when you change parameters during a simulation). Possible ways out: Declare it as a separate, different simulation, if you do not want to mix the data of different parameter values. Or, just store the parameter as part of the time series (non-equilibrium MC). - Organizational structure: We don't need something like that right now, but it might become relevant at some point... - Name of the whole thing?
How to get current simulation values (while running the simulation)? Done by the reporting tool which puts some common data into a plain file like XML, plain text etc.
It was agreed that each group files ther data format proposal for further discussion.
Further points that were discussed/agreed upon:
- XML tags: Use generic tags; probably one could treat <scalar> as a special case of multidimensional data types. - We want support for complex data types. What is a complex observable's 'error bar'? Give choices in the reporting tool. - Which domain for publication of dictionaries, schemas, format specifications etc? stat-phys.org is a good candidate (registred by Helmut). We will start with stat-phys.org.
Data format draft working group
- Matthias Troyer, Jeongnim Kim, Brian
- none currently