Documentation:Running

From ALPS
(Redirected from Running)
Jump to: navigation, search

When starting an application the ALPS-scheduler will organise your simulation. It checkpoints the simulation regularly, so that later you can re-run the simulation, in case it terminates.

For some typical sessions please check the Tutorials.

Starting a simulation

After creating a job-file with parameter2xml you can start the simulation by typing the name of the application followed by the job-file. Here is an example:

./dirloop_sse -T 3600 job.in.xml

which will run the code dirloop_sse for 3600 seconds on the input file job.in.xml. You can learn more about a number of additional command line options by going through this tutorial.

After the first checkpoint job.task*.out.xml and job.task*.out.run1 files will be generated (per default after 30 minutes of runtime). The job.task*.out.xml contains the intermediate result (as long as the simulation did not finish - later the final result) in XML format. The job.task*.out.run1 files contain the simulation status in binary format.

Re-running a simulation

In some situation (such as you have to shut down your computer), your job may stop before it finished. The current status of the simulation however was stored in the job.task*.out.run1 files. ALPS can restart the simulation and continue it, where it ended. To restart you have to change the job.in.xml to the job.out.xml file. Some examples:

./dirloop_sse job.out.xml

Extending a simulation

Even your simulation finished, it may turn out, that the amounts of sweeps you chose was not sufficient for a good result. in such cases you can edit the job.task*.out.xml files. You have to increase the amount of sweeps in the parameters section of that file. Thereafter you edit the file job.out.xml and set the simulation status which is finished back to running.

After that, just proceed in the same manner as for re-running a simulation.

Auto-stopping a simulation

(Currently only implemented in the classical monte carlo application)

If you would like to have a task stop before it has run through its entire set of sweeps, set the parameters ERROR_VARIABLE and ERROR_LIMIT in your parameter file (for all tasks or each individually) where ERROR_VARIABLE is the name of a particular observable (such as 'Energy') and ERROR_LIMIT is the desired absolute error you wish to achieve. Upon doing so, the task will halt and be recorded as finished when the specified error is achieved or when the number of sweeps has run out, whichever comes first.

Running a simulation on a high performance computer

The way to run simulation on clusters, highly depends on the used batch submiting system. Typically you will need to write a small submitting script file and then put the same commands than for the case of using workstations. Following find some example batch-files for different systems:

PBS and LAM/MPI:

#!/bin/sh

### Job name
#PBS -N anssejob

### Declare job non-rerunable
#PBS -r n

### Mail to user
#PBS -m ae

### Queue name
#PBS -q thequeue

### Wall clock time required. We set it to 10 hours
#PBS -l walltime=10:00:00

### Number of nodes. We use 8 nodes, with 2 cpus per node
#PBS -l nodes=8:ppn=2

### Output some information on allocated cpus/nodes
echo $PBS_JOBID : `wc -l < $PBS_NODEFILE` CPUs allocated: `cat $PBS_NODEFILE`
cd $PBS_O_WORKDIR
NPROCS=`wc -l < $PBS_NODEFILE`

### Execute job using mpi 
### This job will run for 9:30 hours (34200 seconds), and dump checkpoints every 2 hours (7200 seconds).
mpiexec -boot -machinefile $PBS_NODEFILE dirloop_sse_mpi -T 34200 --checkpoint-time 7200 ssejobfile.in.xml > ${PBS_JOBNAME}.`echo ${PBS_JOBID} | sed "s/.output//" 


LSF and LAM/MPI:

#!/bin/csh
mpirun -srun dirloop_sse_mpi -T 34200 --checkpoint-time 7200 ssejobfile.in.xml


Example script for the Hreidar cluster

Below is an example script "submitscript" to run a parallel application on the Hreidar cluster. Make sure to set the permissions of the script to executable (e.g. chmod 775 submitscript). It can be submitted to the job queue by e.g.

bsub -n8 -W90 ./submitscript

The program will run on 8 nodes with a maximal time limit of 90 minutes in this example.

#! /bin/sh
#
#  example submitscript
#
export HOST_LIST=host_list.$$
echo $LSB_HOSTS|tr ' ' '\n'|sed 's/.asgard.net//' > $HOST_LIST
nprocs=`wc $HOST_LIST|awk '{print $1}'`
sort $HOST_LIST | uniq > ${HOST_LIST}_lam
lamboot ${HOST_LIST}_lam 
mpirun -np $nprocs dirloop_sse_mpi -T5000 --checkpoint-time 1000 parm4a.in.xml
wipe $HOST_LIST_lam 
rm $HOST_LIST ${HOST_LIST}_lam