The LoadLeveler Batch System
The login node "vip.rzg.mpg.de" of the Power6 system is intended mainly for editing and compiling your parallel programs. Interactive usage of "poe" is not allowed on the login node "vip.rzg.mpg.de". To run test or production jobs, submit them to the LoadLeveler batch system, which will find and allocate the resources required for your job (e.g. the compute nodes to run your job on).
Short test jobs ( < 15 min) with 4, 8 or 16 CPUs will run on a dedicated Power6 node with short turn around times.
By default, the job run limit is set to 1 on "vip". If your batch jobs can run independently from each other, your job run limit can be raised on request.
In principle, you can run your old job scripts from the Regatta or Power5 cluster on the Power6 cluster without change, except that you have to omit the statement "# @ requirements = (Arch == "R6000") && (OpSys >= "AIX53")". (The architecture of the new system is "Power6", not "R6000".)
Your existing job scripts will run in "Single Thread" (ST) mode on the Power6 nodes. We recommend this mode for the beginning.
Later on, you may try the "Simultaneous Multithreading" (SMT) mode that can increase the performance of your application up to 20%. In SMT mode, you have to increase the number of tasks_per_node from 32 to 64. Please be aware that with 64 tasks_per_node each process gets only half of the memory by default. If you need more memory per process you have to specify it in the variable "ConsumableMemory". On Power6, there are 78 compute nodes available with 128 GB of real memory and 2 compute nodes with 256 GB of real memory. So you can specify "ConsumableMemory(1600mb)" for SMT MPI jobs on 1 - 64 nodes with about 100gb of user memory each, and "ConsumableMemory(3600mb)" for SMT MPI jobs on 1 - 2 nodes with 225gb of user memory each.
For best performance of your jobs on the Power6 cluster we recommend NOT to use the MPI environment variable "MP_TASK_AFFINITY=MCM"in your job scripts on Power6. The appropriate task affinity isset by the LoadLeveler submit filter.
For detailed information about LoadLeveler, please see IBM's manual about Using and Administering IBM LoadLeveler for AIX.
The most important Loadleveler commands are
llsubmit
- Submit a job script for execution.
- llq
- Check the status of your job(s).
- llcancel
- Cancel a job.
- llclass
- List the available batch classes.
Since the upgrade to LoadLeveler 4.1 the graphical user interface xloadl is no longer supported by IBM.
Sample Batch job script : sample script
Notes on job scripts :
The variable
# @ node = <nr. of nodes>
gives the number of Power6 nodes that your program will use.
The variable
# @ tasks_per_node = <nr. of cpus>
specifies the number of MPI processes for the job. If you are using OpenMP, you have to set this variable to 1. The parameter @tasks_per_node can not be greater than 64 because one Power6 node has 32 processors with 2 hardware threads each, thus 64 logical CPUs in "Simultaneous Multithreading" (SMT) mode.
The variable
# @ resources = ConsumableCpus(nr. of threads)
specifies the number of threads if you are using OpenMP. In case of MPI, you have to set this variable to 1.
Along with ConsumableCpus(xx) you can specify ConsumableMemory(yyyy)as the memory (in MB) that your job needs per MPI task.
The expression
@tasks_per_node * ConsumableCpus()
may not exceed 64.
The expression
@node * @tasks_per_node * ConsumableCpus()
gives the total number of CPUs that your job will use.
Monitoring a batch job:
You can monitor the cpu and memory usage of your running batch job with the octop tool, e.g.:
octop -c -n <node name>
octop -t -n <node name>
