Difference between revisions of "Documentation:Running"

From ALPS
Jump to: navigation, search
m (Running a simulation on a high performance computer)
(Running a simulation on a high performance computer)
Line 65: Line 65:
  
 
  #!/bin/csh
 
  #!/bin/csh
  mpirun -srun sse_mpi -T 34200 --checkpoint-time 7200 ssejobfile.in.xml
+
  mpirun -srun dirloop_sse_mpi -T 34200 --checkpoint-time 7200 ssejobfile.in.xml

Revision as of 09:25, 5 April 2007

When starting an application the ALPS-scheduler will organise your simulation. It checkpoints the simulation regularly, so that later you can re-run the simulation, in case it terminates.

For some typical sessions please check the Tutorials.

Starting a simulation

After creating a job-file with parameter2xml you can start the simulation by typing the name of the application followed by the job-file. Here is an example:

./dirloop_sse -T 3600 job.in.xml

which will run the code dirloop_sse for 3600 seconds on the input file job.in.xml. You can learn more about a number of additional command line options by going through this tutorial.

After the first checkpoint job.task*.out.xml and job.task*.out.run1 files will be generated (per default after 30 minutes of runtime). The job.task*.out.xml contains the intermediate result (as long as the simulation did not finish - later the final result) in XML format. The job.task*.out.run1 files contain the simulation status in binary format.

Re-running a simulation

In some situation (such as you have to shut down your computer), your job may stop before it finished. The current status of the simulation however was stored in the job.task*.out.run1 files. ALPS can restart the simulation and continue it, where it ended. To restart you have to change the job.in.xml to the job.out.xml file. Some examples:

./dirloop_sse job.out.xml

Extending a simulation

Even your simulation finished, it may turn out, that the amounts of sweeps you chose was not sufficient for a good result. in such cases you can edit the job.task*.out.xml files. You have to increase the amount of sweeps, and set the simulation status which is finished back to running. Example:

After that, just proceed in the same manner as for re-running a simulation.

Running a simulation on a high performance computer

The way to run simulation on clusters, highly depends on the used batch submiting system. Typically you will need to write a small submitting script file and then put the same commands than for the case of using workstations. Following find some example batch-files for different systems:

PBS and LAM/MPI:

#!/bin/sh

### Job name
#PBS -N anssejob

### Declare job non-rerunable
#PBS -r n

### Mail to user
#PBS -m ae

### Queue name
#PBS -q thequeue

### Wall clock time required. We set it to 10 hours
#PBS -l walltime=10:00:00

### Number of nodes. We use 8 nodes, with 2 cpus per node
#PBS -l nodes=8:ppn=2

### Output some information on allocated cpus/nodes
echo $PBS_JOBID : `wc -l < $PBS_NODEFILE` CPUs allocated: `cat $PBS_NODEFILE`
cd $PBS_O_WORKDIR
NPROCS=`wc -l < $PBS_NODEFILE`

### Execute job using mpi 
### This job will run for 9:30 hours (34200 seconds), and dump checkpoints every 2 hours (7200 seconds).
mpiexec -boot -machinefile $PBS_NODEFILE dirloop_sse_mpi -T 34200 --checkpoint-time 7200 ssejobfile.in.xml > ${PBS_JOBNAME}.`echo ${PBS_JOBID} | sed "s/.output//" 


LSF and LAM/MPI:

#!/bin/csh
mpirun -srun dirloop_sse_mpi -T 34200 --checkpoint-time 7200 ssejobfile.in.xml