(Redirected from Running)
Jump to: navigation, search

When starting an application the ALPS-scheduler will organise your simulation. It checkpoints the simulation regularly, so that later you can re-run the simulation, in case it terminates.

For some typical sessions please check the Tutorials.

Starting a simulation

After creating a job-file with parameter2xml you can start the simulation by typing the name of the application followed by the job-file. Here is an example:

./dirloop_sse -T 3600

which will run the code dirloop_sse for 3600 seconds on the input file You can learn more about a number of additional command line options by going through this tutorial.

After the first checkpoint job.task*.out.xml and job.task*.out.run1 files will be generated (per default after 30 minutes of runtime). The job.task*.out.xml contains the intermediate result (as long as the simulation did not finish - later the final result) in XML format. The job.task*.out.run1 files contain the simulation status in binary format.

Re-running a simulation

In some situation (such as you have to shut down your computer), your job may stop before it finished. The current status of the simulation however was stored in the job.task*.out.run1 files. ALPS can restart the simulation and continue it, where it ended. To restart you have to change the to the job.out.xml file. Some examples:

./dirloop_sse job.out.xml

Extending a simulation

Even your simulation finished, it may turn out, that the amounts of sweeps you chose was not sufficient for a good result. in such cases you can edit the job.task*.out.xml files. You have to increase the amount of sweeps in the parameters section of that file. Thereafter you edit the file job.out.xml and set the simulation status which is finished back to running.

After that, just proceed in the same manner as for re-running a simulation.

Auto-stopping a simulation

(Currently only implemented in the classical monte carlo application)

If you would like to have a task stop before it has run through its entire set of sweeps, set the parameters ERROR_VARIABLE and ERROR_LIMIT in your parameter file (for all tasks or each individually) where ERROR_VARIABLE is the name of a particular observable (such as 'Energy') and ERROR_LIMIT is the desired absolute error you wish to achieve. Upon doing so, the task will halt and be recorded as finished when the specified error is achieved or when the number of sweeps has run out, whichever comes first.

Running a simulation on a high performance computer

The way to run simulation on clusters, highly depends on the used batch submiting system. Typically you will need to write a small submitting script file and then put the same commands than for the case of using workstations. Following find some example batch-files for different systems:



### Job name
#PBS -N anssejob

### Declare job non-rerunable
#PBS -r n

### Mail to user
#PBS -m ae

### Queue name
#PBS -q thequeue

### Wall clock time required. We set it to 10 hours
#PBS -l walltime=10:00:00

### Number of nodes. We use 8 nodes, with 2 cpus per node
#PBS -l nodes=8:ppn=2

### Output some information on allocated cpus/nodes
echo $PBS_JOBID : `wc -l < $PBS_NODEFILE` CPUs allocated: `cat $PBS_NODEFILE`

### Execute job using mpi 
### This job will run for 9:30 hours (34200 seconds), and dump checkpoints every 2 hours (7200 seconds).
mpiexec -boot -machinefile $PBS_NODEFILE dirloop_sse_mpi -T 34200 --checkpoint-time 7200 > ${PBS_JOBNAME}.`echo ${PBS_JOBID} | sed "s/.output//" 


mpirun -srun dirloop_sse_mpi -T 34200 --checkpoint-time 7200

Example script for the Hreidar cluster

Below is an example script "submitscript" to run a parallel application on the Hreidar cluster. Make sure to set the permissions of the script to executable (e.g. chmod 775 submitscript). It can be submitted to the job queue by e.g.

bsub -n8 -W90 ./submitscript

The program will run on 8 nodes with a maximal time limit of 90 minutes in this example.

#! /bin/sh
#  example submitscript
export HOST_LIST=host_list.$$
echo $LSB_HOSTS|tr ' ' '\n'|sed 's/' > $HOST_LIST
nprocs=`wc $HOST_LIST|awk '{print $1}'`
sort $HOST_LIST | uniq > ${HOST_LIST}_lam
lamboot ${HOST_LIST}_lam 
mpirun -np $nprocs dirloop_sse_mpi -T5000 --checkpoint-time 1000
wipe $HOST_LIST_lam