Revision as of 00:09, 23 February 2007 by Gamperl (talk | contribs)

Jump to: navigation, search

When starting an application the ALPS-scheduler will organise your simulation. It checkpoints the simulation regularly, so that later you can re-run the simulation, in case it terminates.

For some typical sessions please check the Tutorials.

Starting a simulation

After creating a job-file with parameter2xml you can start the simulation by typing the name of the application followed by the job-file. Here is an example:

./sse -T 3600

which will run the code sse for 3600 seconds on the input file You can learn more about a number of additional command line options by going through this tutorial.

After the first checkpoint job.task*.out.xml and job.task*.out.run1 files will be generated (per default after 30 minutes of runtime). The job.task*.out.xml contains the intermediate result (as long as the simulation did not finish - later the final result) in XML format. The job.task*.out.run1 files contain the simulation status in binary format.

Re-running a simulation

In some situation (such as you have to shut down your computer), your job may stop before it finished. The current status of the simulation however was stored in the job.task*.out.run1 files. ALPS can restart the simulation and continue it, where it ended. To restart you have to change the to the job.out.xml file. Some examples:

./sse job.out.xml

Extending a simulation

Even your simulation finished, it may turn out, that the amounts of sweeps you chose was not sufficient for a good result. in such cases you can edit the job.task*.out.xml files. You have to increase the amount of sweeps, and set the simulation status which is finished back to running. Example:

After that, just proceed in the same manner as for re-running a simulation.

Running a simulation on a high performance computer

The way to run simulation on clusters, highly depends on the used batch submiting system. Typically you will need to write a small submitting script file and then put the same commands than for the case of using workstations. Following find some example batch-files for different systems:


  1. !/bin/sh
      1. Job name
  1. PBS -N anssejob
      1. Declare job non-rerunable
  1. PBS -r n
      1. Mail to user
  1. PBS -m ae
      1. Queue name
  1. PBS -q thequeue
      1. Wall clock time required. We set it to 10 hours
  1. PBS -l walltime=10:00:00
      1. Number of nodes. We use 8 nodes, with 2 cpus per node
  1. PBS -l nodes=8:ppn=2
      1. Output some information on allocated cpus/nodes

echo $PBS_JOBID$nbsp;: `wc -l < $PBS_NODEFILE` CPUs allocated: `cat $PBS_NODEFILE` cd $PBS_O_WORKDIR NPROCS=`wc -l < $PBS_NODEFILE`

      1. Execute job using mpi
      2. This job will run for 9:30 hours (34200 seconds), and dump checkpoints every 2 hours (7200 seconds).

mpiexec -boot -machinefile $PBS_NODEFILE sse_mpi -T 34200 --checkpoint-time 7200 > ${PBS_JOBNAME}.`echo ${PBS_JOBID} | sed "s/.output//" `


  1. !/bin/csh

mpirun -srun sse_mpi -T 34200 --checkpoint-time 7200