Porting problems => MC sims -> bins to file, then calculate averages with external tools (MC / data analysis separated): stores bins and bin means! lot of data, lot of bins for covariance matrix.
Format: for each bin: average sign (some number), observable sign. each file has a name corresponding to the observable. file does not contain meta information. further bins are simply appended.
tools: Jackknife, Bootstraps, Maxent;
throws stuff into multiple plaintext files like raw-lr-<sytem_size>_seedXXXX.dat; reason: a lot of data -> too larger for XML, about 1000 files per sim
basically a header (metadata, hash) and the data body
stored are: usually time series by storing snapshots in logarithmics distances;
uses exchange MC -> store many temperatures, 3-10 observables (3-10 columns); then compressed (bz2)
blocks of lines/rows (for different sim times; separated by empty lines
tools: use perl for parsing;
wants to have: easier parsing "do jackknife for this files"; improve readability by tabbed XML in order to have readable XML (no long tags) to have XML in a table structure. DB integration for storage; also for publications
no binary format because: 1. architecture problems, 2. human readable
Basicall XML for input specification, output in XML/hdf5;
interested in standard format for interoperability between QMC and QC/DFT packages.
wishes: modular, hierarchical structure; extensible, interoperability
current implementation: hybrid XML plus hdf5; XML stores summaries of sims; hdf5 contains "everything" (like ALPS?). reason of usage of binary format is its efficiency.
why hdf5? highly organized and hierarchical storage format; it has multidimensional array ("data sets") and groups (directory like structures), key point was hdf5 efficiency (I/O) for high I/O applications (in particular for parallel IO); collection of analysis tools for data manipulation and visualtization; it is an open standard and platform independent.
initial problem: no fortran library (but a new implementations exist)
benchmarking was performed => it's fast and efficient
stored are for example: scalar data and tabular data mainly; average data for tabular data. occasionally time series (therefore efficiency was critical)
cool thing about hdf5: has conversion tools (dump multiple hdf5 data files into a merged one, the library does already a lot of work for; filters, convert to XML etc.)
storage backend: filesystem/directories/files
hdf5 has a stable API, proven, known-to-work technology.
motivation for XML:
1. interoperability; the code 'grows', evolves; multiple codes/implementations exist for similar problems, so the main motivation for XML is to have interoperability between different codes.
major problem: fortran plus XML = no good; but they're working on a fortran XML lib.
usage of XSLT with XML schemas/grammars (XSD). Community wide standards are not as important as uniform specifications within particular codes (no Uber-Format needed).
2. visualization is important, for example using VISIT (which supports XML as I understand)
3. very large scale runs, there is a need to make the data easily available to the public (the 'tax payers' want to know what's going on)
Details on the type of simulations that are performed: use DFT to compute energy, Wang-Landau sampling for DOS => free energy. not too much IPC between different nodes is required, so it should scale. Some of the codes are here to stay for a long time, so a good data format is important.
Does DCA-Hubbard calculations (QMC for Hubbard-like systems). Primary interest in green functions.
Implementations: Storage of block data in text file, currently untagged formats, so changes are difficult later, especially if multiple people are working on the same project. About 1GB of data, but growing.
Problem: Not only branch codes, but also branch formats...
Workflow: QMC with small text files as input => pos-processing analysis. Unfortunately, this usually requires good knoweledge of the simulation setup/internals. Many small ad-hoc utilities are used.
Wishes/plans: 1. easier workflow 2. single/small set of analysis tools 3. reduce errors as simulations are expensive. 4. store more data, weaker coupling of simulation and analysis tools. 5. interoperability for collaboration and publication of datasets.
Current implementation: Write everything in a line (model parameters, meta information, such as a serial number to specify the project. Then the bins containing parameters and data averages.
Problems: Data files become too large after some time (disk quota, slow analysis/parsing; sometimes the system maximum line length is too short to store all required data.
Problem: Three different collaborations; One of which (Lode's one-man show): Plain text file (like Helmut); it works for personal use/if you are the only developer. Collaborating project: Special way of calculating error bars: Make measurements; write down only the running means. Throw away first quarter of measurements. Then find the maximum error/variance of the remaining part. Provides a nice way to see what happens. Data format: Dump everything into a plain text file.
COST (what is this exactly?)
Currently: A lot of hand-made analysis/rules; collaboration within many different groups (some parts of the workflow are performed by different groups).
For large scale application: intelligent format for analysis is needed; Issues: Archiving/extraction. The problem is: They will develop their ideas by doing sims, so it is not always entirely clear in advance what kind of measurement/data is needed exactly. A good way to extract the data is therefore important.
XML/CML with XSLT for extraction.
QC application -> output -> Parser -> XML (in DB (XINDICE), but its slow) -> XSLT -> analysis (also into DB).
This process has been tested successfully, high level automation, error recovery. XML DB storage was a success.
Open tasks/questions: XML schemas; more robust workflow (currently scripts driven).
How it is done in ALPS:
Store: Meta data: parameters for simulation, simulation history (sim was working when on what machine). Future ideas: Also store hardware and software (compiler etc.) information to track problems. Maybe also store user information, store simulation status ("how much do you trust the results"), publication information (whom to cite, publications related to this data).
In XML file: Storage of (partially) evaluated data, such as observable name, index for vector-valued observables; count, sum, sum of squares, mean, variance; autocorrelation times, error convergence; bin means, errors of bins. Desired extension: Dictionary link to explain observables (to specify conventions, for example, +/- J in Heisenberg hamiltonians)
In binary data: time series; jackknife bins; hdf5 and XDR is implemented. Possible extensions: for large number of observables, it would be better to store more stuff in binary format (file size problems) than it is done now.
XML input format: describe tags by attribute (ie. <parameter name="BLAH"> instead of <BLAH>...</BLAH>). XML output: same story here, so the grammar is basically generic. Link to time series binary data.
Data representation/transformation: XSLT
Tools implemented: Jackknife, error convergence tests, creation of plots from results of many simulations.
Archive data in DB.
Format requirements: Flexible: No hardcoded names Links to dictionary pages Human readable Analysis using other tools/languages (Python,Perl etc.)
Ideas: idea 1: XML for meta data; binary format: XNF, which supports 1D,2D and 3D arrays. Takes care of endianness. idea 2: XDMF: XML for meta data with a link to HDF file which contains the binary data. Advantage: Support for visualization tools.
Archiving of ALPS output: parse XML into DB (SQL-based). Provides central place of storage for later analysis. Loading into DB takes some time, but after that, data access is very fast.