===============================================================================
Readme file for: IBM® Platform LSF
Product/Component Release: 9.1.2
Update name: Enhancement pack for Smart CPU Usage Enforcement
Publication date: 09 June 2014
Last modified: 09 June 2014
Abstract: This enhancement introduces a way to kill a job that uses the
most CPU when the host's CPU usage
has reached a
configured threshold.
===============================================================================
=========================
CONTENTS
=========================
1. About this enhancement
2. Supported operating systems
3. Products or components affected
4. Installation and Configuration
5. Copyright
=====================
1. About this enhancement
=====================
This
enhancement allows the system to kill the job using the most CPU if the average
logic CPU r15m value and the UT value
both reach a configured threshold on the host. This allows other jobs on the host to run smoothly.
A job is considered the worst CPU offending job on a host if it is using the
most CPU (system time + user time) for an average
assigned slot during the check period.
When one job is killed as worst CPU offending job, the exit reason is same as job's normal CPU limit reached: "job killed after reaching LSF CPU usage limit"
Here Smart CPU Usgae Enforcement is considered as one special case of normal CPU limit function, just in the host level.
This
solution is configured through a new configuration parameter in lsf.conf:
LSB_CPU_USAGE_ENF_CONTROL=<Average Logic CPU r15m Threshold>:<UT Threshold>:<Check Interval>
Description:
1) Average Logic CPU r15m Threshold:
A threshold value for the maximum limit for the quotient of host lsload
command' r15m value and the count of host logic CPU. This means the
average CPU queue length during the last 15 minutes for one logic CPU on the
host.
It
must be a floating-point number, equal to or bigger than zero (0). For example,
7.8, 2.1, 0.9, and so on.
2)
UT Threshold:
A threshold for the maximum limit of the host lsload command's UT value. The UT
value is the CPU utilization exponentially averaged over
the last minute,
between 0 and 1.
It
must be a floating-point number between 0 and 1. For example,
0.4, 0.5, 0.24, and so on.
3) Check Interval:
The smallest period of time during which the host's r15m and UT information
will not be checked between two close checking cycles.
This
value must be not less than the value of SBD_SLEEP_TIME and the unit is in
seconds. For example, 20, 40, 60, and so on.
4) The host is considered to be in CPU overload when <Average Logic CPU r15m
Threshold> and <UT Threshold> have both been reached.
5) This parameter does not affect jobs running across multiple hosts.
Default:
Not
defined
Example:
The
following example shows how to calculate the average logic CPU and the UT.
-----------------------------------------------------------------
lsload
HOST_NAME status r15s r1m
r15m ut pg
ls it tmp swp
mem
hostA ok 0.0 0.1 1.6
34% 0.0
1 7 91G 11.7G 22.6G
lshosts -l
HOST_NAME: hostA
type
model cpuf
ncpus ndisks
maxmem maxswp
maxtmp rexpri server nprocs
ncores nthreads
X86_64 Intel_EM64T 60.0 8 1 23.9G
11.7G 222589M
0 Yes
2
4 2
RESOURCES: (mg)
RUN_WINDOWS: (always open)
LOAD_THRESHOLDS:
r15s r1m r15m ut
pg io ls it tmp
swp mem
- 3.5
- - -
- - -
- - -
-----------------------------------------------------------------
The average logic CPU r15m is defined to <lsload r15m value>/<logic CPU count>
The
<lsload r15m value> is the content of the "r15m"
column displayed using the "lsload"
command.
The
<logic CPU count> is the product of the
"nprocs" column, the "ncores" column, and the "nthreads"
column displayed
using the "lshosts -l" command.
The UT is the content of the "ut" column displayed
using the "lsload" command.
With
the data above, you can calculate the following:
Current average logic CPU r15m value = 1.6/(2*4*2) = 0.1
Current
UT value = 0.34(34%)
Therefore,
with a check interval of 30 (30 seconds), and using the above data, the
parameter would be set as follows:
LSB_CPU_USAGE_ENF_CONTROL=<0.1>:<0.34>:<30>
=========================
2. Supported operating systems
=========================
linux2.6-glibc2.3-x86_64
aix-64
=========================
3. Products or components affected
=========================
Affected
components include:
LSF/sbatchd
=========================
4. Installation and Configuration
=========================
4.1 Before
installation:
(LSF_TOP=Full path to the top-level installation directory of
LSF.)
1) Log on to the LSF master host as root
2) Set your environment:
- For csh or tcsh:
% source LSF_TOP/conf/cshrc.lsf
- For sh, ksh,
or bash: $ . LSF_TOP/conf/profile.lsf
4.2 Installation
steps:
1) Go to the patch install directory: cd
$LSF_ENVDIR/../9.1/install/
2) Copy the patch file to the install directory $LSF_ENVDIR/../9.1/install/
3) Run patchinstall: ./patchinstall <patch>
4.3 After
installation:
1) Log on to the LSF master host as root.
2) Add the configuration key as described above to lsf.conf :
LSB_CPU_USAGE_ENF_CONTROL=<Average
Logic CPU r15m Threshold>:<UT
Threshold>:<Check Interval>
3) Run "badmin hrestart
all"
4.4 Uninstallation:
1) Log on to the LSF master host as root.
2) Run ./patchinstall
-r <patch>
3) Run "badmin hrestart
all"
=========================
5. Copyright
=========================
© Copyright IBM Corporation 2014
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
IBM®, the IBM logo and ibm® are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide.
Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the
Web at "Copyright and trademark information" at www.ibm.com/legal/copytrade.shtml