IBM Spectrum LSF 10.1 Fix 600061 Readme File
Abstract
P104014. This solution enables LSF to support dynamic or static scheduling for NVIDIA Multi Instance GPU (MIG)
Description
Readme documentation for IBM Spectrum LSF 10.1 Fix 600061 including installation-related instructions, prerequisites and co-requisites, and list of fixes.
This solution supports the following functions:
1. Introduce LSF_MANAGE_MIG=Y|N in the lsf.conf file to control dynamic or static MIG scheduling.
2. Introduce the new keyword "mig" to the bsub -gpu option to submit MIG jobs.
3. Introduce the new keyword "mig" to the -R resource requirement string in the rusage section.
4. The lshosts -gpu command option marks the MIG GPUs.
5. The bhosts command shows the MIG allocation on each host
6. The bjobs, bhist, and bacct commands show the MIG requirement/allocation for each MIG job
7. MIG allocation is also displayed in the bhosts -o -json and bjobs -o -json command options.
Note: to use this
feature, the Nvidia A100 GPU must
be set to MIG mode
first
1.
Configuration
In the lsf.conf
file set the following parameters
LSF_GPU_RESOURCE_IGNORE=Y
LSF_GPU_AUTOCONFIG=Y
LSB_GPU_NEW_SYNTAX=extend
LSF_MANAGE_MIG=Y/N
2.
Introduce
a new parameter LSF_MANAGE_MIG in the lsf.conf file:
Syntax
LSF_MANAGE_MIG = Y | y | N | n
Description
If
LSF_MANAGE_MIG=Y, LSF dynamically creates GPU instances (GI) and compute
instances (CI) on each host, and the MIG of each host is controlled by LSF.
If LSF_MANAGE_MIG=N,
LSF just allocates the GI and CI based on the configuration of each MIG host,
and dispatch them to the MIG hosts, LSF does not create or destroy the GI and
CI on the MIG hosts.
If users change the
value of the LSF_MANAGE_MIG, firstly close all MIG hosts to avoid dispatching
MIG jobs with command ‘badmin hclose’,after the running MIG jobs finish, then
restart LSF.
Default
N
Notes:
You must not create
or destroy MIG devices manually outside of LSF if LSF_MANAGE_MIG=Y is set.
If LSF_MANAGE_MIG=N
is set and users want to change MIG devices on some hosts, user must close the
target host to avoid dispatching MIG jobs with command ‘badmin hclose’, after
the running MIG jobs finish on the target host, destroy the existing MIG
devices, create new MIG devices
3.
Specify MIG device requirements for a job
Use the new
"mig" keyword in the bsub -gpu option to submit a MIG job as follows:
bsub -gpu
“num=2:mig=4/4” gpuApp
This
job requires 2 MIG devices (num=2) with a GPU instance of size 4 and compute
instance of size 4
(mig=4/4).
Alternatively,
use the new "mig" keyword in the bsub -R resource requirments string
in the rusage
section as follows:
bsub -R "rusage[ngpus_physical=2:mig=4/4]" gpuApp
The format of mig option is ‘mig=GI_size/CI_size’
Valid GI_size
values are: 1, 2, 3, 4, 7
The value of
CI_size must be less than or equal to GI_size.
NVIDIA
does not support the following GI_size/CI_size combinations:
4/3 7/5 7/6
The
mig keyword can be specified at the application and queue levels with the
GPU_REQ or RES_REQ
parameters.
For
example,
GPU_REQ=num=4:mig=4/1:mode=shared
or
RES_REQ=rusage[ngpus_physical=4:mig=2/1]
span[gtile=4]
4. GPU fragmentation
MIG imposes
restrictions on which GPU slices can be combined to create GPU instances. This
can lead to fragmentation as GPU instances are created and destroyed.
If you find that
some jobs are pending even when there are enough free GPU slices to create GI
for a job, this might be caused by GPU fragmentation. For more information,
refer to Nvidia documentation.
Out of Scope:
1. MIG Planer scheduling.
2. MIG Preemption scheduling
3. MIG resource reservation.
4. MIG for mps shared case
5. MIG gpack/block scheduling
6. MIG DCGM rusage collection
7. MIG controlled by cgroup
Readme File for: IBM® Spectrum LSF
Product/Component Release: 10.1
Update Name: Fix 600061
Fix ID: LSF-10.1-build600061
Publication Date: 21 December 2020
Last Modified Date: 21 December 2020
Contents
1. List of Fixes
2. Download Location
3. Product or Components Affected
4. System Requirements
5. Installation and Configuration
6. List of Files
7. Product Notifications
8. Copyright and Trademark Information
1. List of Fixes
P104014
2. Download Locations
Download Fix 600061 from the following location: http://www.ibm.com/eserver/support/fixes/
3. Product or Components Affected
Affected product or components include:
LSF/lim
LSF/res
LSF/sbatchd
LSF/mbatchd
LSF/mbschd
LSF/bjobs
LSF/bmod
LSF/lshosts
LSF/lsload
LSF/bsub
LSF/bacct
LSF/bhist
LSF/bhosts
LSF/bswitch
LSF/lsbatch.h
LSF/lsf.h
LSF/liblsf.a
LSF/liblsf.so
LSF/libbat.a
LSF/libbat.so
4. System Requirements
linux2.6-glibc2.3-x86_64
linux3.10-glibc2.17-x86_64
5. Installation and Configuration
5.1 Before installation
(LSF_TOP=Full path to the top-level installation directory of LSF.)
1) Log on to the LSF master host as the LSF primary administrator
2) Set your environment:
- For csh or tcsh: % source LSF_TOP/conf/cshrc.lsf
- For sh, ksh, or bash: $ . LSF_TOP/conf/profile.lsf
5.2 Installation steps
1) Run badmin hclose all
2) Run badmin qinact all
3) Log on to the LSF master host as root and set the LSF cluster environment
4) Go to the patch install directory: cd $LSF_ENVDIR/../10.1/install/
5) Copy the patch file to the install directory $LSF_ENVDIR/../10.1/install/
6) Run patchinstall: ./patchinstall <patch>
5.3 After installation
1) Log on to the LSF master host as the LSF cluster primary administrator and set the LSF cluster environment
2) Run lsadmin limrestart all
3) Run lsadmin resrestart all
4) Run badmin hrestart all
5) Run badmin mbdrestart -s
6) Run badmin hopen all
7) Run badmin qact all
5.4 Uninstallation
1) Log on to the LSF master host as the LSF cluster primary administrator and set the LSF cluster environment
2) Run badmin hclose all
3) Run badmin qinact all
4) Log on to the LSF master host as root and set the LSF cluster environment
5) Go to the patch install directory: cd $LSF_ENVDIR/../10.1/install/
6) Run ./patchinstall -r <patch>
7) Log on to the LSF master host as the LSF cluster primary administrator and set the LSF cluster environment
8) Run lsadmin limrestart all
9) Run lsadmin resrestart all
10) Run badmin hrestart all
11) Run badmin mbdrestart -s
12) Run badmin hopen all
13) Run badmin qact all
6. List of Files
bsub bmod bjobs bacct bhist bhosts bswitch lshosts lsload lim res sbatchd mbatchd mbschd lsbatch.h lsf.h liblsf.a liblsf.so libbat.a libbat.so
7. Product Notifications
To receive information about product solution and patch updates automatically, subscribe to product notifications on the My notifications page ( www.ibm.com/support/mynotifications) on the IBM Support website (support.ibm.com). You can edit your subscription settings to choose the types of information you want to get notification about, for example, security bulletins, fixes, troubleshooting, and product enhancements or documentation changes.
8. Copyright and Trademark Information
©Copyright IBM Corporation 2020
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
IBM®, the IBM logo, and ibm.com® are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at www.ibm.com/legal/copytrade.shtml.