===============================================================================
Readme file for: IBM® Platform LSF
Product/Component Release: 9.1.2
Update name: Enhancement pack for Smart CPU Usage Enforcement
Publication date: 09 June 2014
Last modified: 09 June 2014

Abstract:  This enhancement introduces a way to kill a job that uses the most CPU when the host's CPU usage

has reached a configured threshold.
===============================================================================

=========================
CONTENTS
=========================
1. About this enhancement
2.
Supported operating systems
3.
Products or components affected
4. Installation and Configuration
5.
Copyright


=====================
1. About this enhancement
=====================

This enhancement allows the system to kill the job using the most CPU if the average logic CPU r15m value and the UT value

both reach a configured threshold on the host. This allows other jobs on the host to run smoothly.


A job is considered the worst CPU offending job on a host if it is using the most CPU (system time + user time) for an average

assigned slot during the check period.

 

When one job is killed as worst CPU offending job, the exit reason is same as job's normal CPU  limit reached: "job killed after reaching LSF CPU usage limit"

Here Smart CPU  Usgae Enforcement is considered as one special case of normal CPU limit function, just in the host level.

 

This solution is configured through a new configuration parameter in lsf.conf:

LSB_CPU_USAGE_ENF_CONTROL=<Average Logic CPU r15m Threshold>:<UT Threshold>:<Check Interval>

Description:

1) Average Logic CPU r15m Threshold:
A threshold value for the maximum limit for the quotient of host lsload command' r15m value and the count of host logic CPU. This means the
average CPU queue length during the last 15 minutes for one logic CPU on the host.

It must be a floating-point number, equal to or bigger than zero (0). For example, 7.8, 2.1, 0.9, and so on.

 

2) UT Threshold:
A threshold for the maximum limit of the host lsload command's UT value. The UT value is the CPU utilization exponentially averaged over 

the last minute, between 0 and 1.

It must be a floating-point number between 0 and 1. For example, 0.4, 0.5, 0.24, and so on.


3) Check Interval:
The smallest period of time during which the host's r15m and UT information will not be checked between two close checking cycles.

This value must be not less than the value of SBD_SLEEP_TIME and the unit is in seconds. For example, 20, 40, 60, and so on.

4) The host is considered to be in CPU overload when <Average Logic CPU r15m Threshold> and <UT Threshold> have both been reached.

5) This parameter does not affect jobs running across multiple hosts.

 

Default:

Not defined

 

Example:

The following example shows how to calculate the average logic CPU and the UT.

-----------------------------------------------------------------

lsload
HOST_NAME    status  r15s   r1m  r15m   ut    pg    ls    it   tmp   swp    mem
hostA                     ok     0.0    0.1   1.6   34%  0.0   1    7   91G 11.7G 22.6G

lshosts -l
HOST_NAME:  hostA
type             model            cpuf     ncpus  ndisks    maxmem    maxswp  maxtmp        rexpri  server  nprocs  ncores  nthreads
X86_64     Intel_EM64T 60.0       8       1           23.9G       11.7G     222589M      0        Yes      2         4           2

RESOURCES: (mg)
RUN_WINDOWS:  (always open)

LOAD_THRESHOLDS:
  r15s   r1m  r15m   ut    pg    io   ls   it   tmp   swp   mem
     -     3.5     -    -     -     -    -    -     -     -     -
-----------------------------------------------------------------

The average logic CPU r15m is defined to <lsload r15m value>/<logic CPU count>

The <lsload r15m value> is the content of the "r15m" column displayed using the "lsload" command.

The <logic CPU count> is the product of  the "nprocs" column,  the "ncores" column, and the "nthreads" column displayed

using the "lshosts -l" command.


The UT is the content of the "ut" column displayed using the "lsload" command.

 

With the data above, you can calculate the following:
Current  average logic CPU r15m value = 1.6/(2*4*2) = 0.1

Current UT value = 0.34(34%)

Therefore, with a check interval of 30 (30 seconds), and using the above data, the parameter would be set as follows:

LSB_CPU_USAGE_ENF_CONTROL=<0.1>:<0.34>:<30>

 

=========================
2. Supported operating systems
=========================
linux2.6-glibc2.3-x86_64

aix-64

=========================

3. Products or components affected

=========================

Affected components include:

LSF/sbatchd

=========================

4. Installation and Configuration

=========================

4.1  Before installation:

 

      (LSF_TOP=Full path to the top-level installation directory of LSF.)

      1) Log on to the LSF master host as root

      2) Set your environment:

   -  For csh or tcsh: % source LSF_TOP/conf/cshrc.lsf

   -  For sh, ksh, or bash: $ . LSF_TOP/conf/profile.lsf

 

4.2  Installation steps:

 

     1) Go to the patch install directory: cd $LSF_ENVDIR/../9.1/install/

     2) Copy the patch file to the install directory $LSF_ENVDIR/../9.1/install/

     3) Run patchinstall: ./patchinstall <patch>

 

4.3  After installation:

 

    1) Log on to the LSF master host as root.

    2) Add the configuration key as described above to lsf.conf :

   LSB_CPU_USAGE_ENF_CONTROL=<Average Logic CPU r15m Threshold>:<UT Threshold>:<Check Interval>

    3) Run "badmin hrestart all"

 

4.4  Uninstallation:

 

    1) Log on to the LSF master host as root.

    2) Run ./patchinstall -r <patch>

    3) Run "badmin hrestart all"

=========================
5. Copyright
=========================

© Copyright IBM Corporation 2014

U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

IBM®, the IBM logo and ibm® are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. 

Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the 

Web at "Copyright and trademark information" at www.ibm.com/legal/copytrade.shtml