IBM Spectrum LSF 10.1 Fix 600061 Readme File 

Abstract

P104014. This solution enables LSF to support dynamic or static scheduling for NVIDIA Multi Instance GPU (MIG) 

Description

Readme documentation for IBM Spectrum LSF 10.1 Fix 600061 including installation-related instructions, prerequisites and co-requisites, and list of fixes.

This solution supports the following functions:

1.     Introduce LSF_MANAGE_MIG=Y|N in the lsf.conf file to control dynamic or static MIG scheduling.

2.     Introduce the new keyword "mig" to the bsub -gpu option to submit MIG jobs.

3.     Introduce the new keyword "mig" to the -R resource requirement string in the rusage section.

4.     The lshosts -gpu command option marks the MIG GPUs.

5.     The bhosts command shows the MIG allocation on each host

6.     The bjobs, bhist, and bacct commands show the MIG requirement/allocation for each MIG job

7.     MIG allocation is also displayed in the bhosts -o -json and bjobs -o -json command options.

 

Note: to use this feature, the Nvidia A100 GPU must be set to MIG mode first

 

1.     Configuration
In the lsf.conf file set the following parameters
LSF_GPU_RESOURCE_IGNORE=Y
LSF_GPU_AUTOCONFIG=Y
LSB_GPU_NEW_SYNTAX=extend
LSF_MANAGE_MIG=Y/N

2.     Introduce a new parameter LSF_MANAGE_MIG in the lsf.conf file:
Syntax
LSF_MANAGE_MIG = Y | y | N | n

                        

Description
If LSF_MANAGE_MIG=Y, LSF dynamically creates GPU instances (GI) and compute instances (CI) on each host, and the MIG of each host is controlled by LSF.

If LSF_MANAGE_MIG=N, LSF just allocates the GI and CI based on the configuration of each MIG host, and dispatch them to the MIG hosts, LSF does not create or destroy the GI and CI on the MIG hosts.
If users change the value of the LSF_MANAGE_MIG, firstly close all MIG hosts to avoid dispatching MIG jobs with command ‘badmin hclose’,after the running MIG jobs finish, then restart LSF.

Default
N

Notes:
You must not create or destroy MIG devices manually outside of LSF if LSF_MANAGE_MIG=Y is set.
If LSF_MANAGE_MIG=N is set and users want to change MIG devices on some hosts, user must close the target host to avoid dispatching MIG jobs with command ‘badmin hclose’, after the running MIG jobs finish on the target host, destroy the existing MIG devices, create new MIG devices

 

 

3.     Specify MIG device requirements for a job

Use the new "mig" keyword in the bsub -gpu option to submit a MIG job as follows:
bsub -gpu “num=2:mig=4/4” gpuApp

This job requires 2 MIG devices (num=2) with a GPU instance of size 4 and compute
instance of size 4 (mig=4/4).

Alternatively, use the new "mig" keyword in the bsub -R resource requirments string
in the rusage section as follows:

bsub -R "rusage[ngpus_physical=2:mig=4/4]" gpuApp

The format of mig option is ‘mig=GI_size/CI_size’
Valid GI_size values are: 1, 2, 3, 4, 7
The value of CI_size must be less than or equal to GI_size.

NVIDIA does not support the following GI_size/CI_size combinations:
4/3 7/5 7/6

The mig keyword can be specified at the application and queue levels with the
GPU_REQ or RES_REQ parameters.

For example,
GPU_REQ=num=4:mig=4/1:mode=shared
or
RES_REQ=rusage[ngpus_physical=4:mig=2/1] span[gtile=4]

 

4.      GPU fragmentation
MIG imposes restrictions on which GPU slices can be combined to create GPU instances. This can lead to fragmentation as GPU instances are created and destroyed.
If you find that some jobs are pending even when there are enough free GPU slices to create GI for a job, this might be caused by GPU fragmentation. For more information, refer to Nvidia documentation.

 

    Out of Scope:

1.     MIG Planer scheduling.

2.     MIG Preemption scheduling

3.     MIG resource reservation.

4.     MIG for mps shared case

5.     MIG gpack/block scheduling

6.     MIG DCGM rusage collection

7.     MIG controlled by cgroup

 

 

        

  

Readme File for: IBM® Spectrum LSF

Product/Component Release: 10.1

Update Name: Fix 600061

Fix ID: LSF-10.1-build600061

Publication Date: 21 December 2020

Last Modified Date: 21 December 2020

 

Contents

1. List of Fixes

2. Download Location

3. Product or Components Affected

4. System Requirements

5. Installation and Configuration

6. List of Files

7. Product Notifications

8. Copyright and Trademark Information

 

1. List of Fixes

P104014

 

2. Download Locations

Download Fix 600061 from the following location: http://www.ibm.com/eserver/support/fixes/

 

3. Product or Components Affected

Affected product or components include:

LSF/lim

LSF/res

LSF/sbatchd

LSF/mbatchd

LSF/mbschd

LSF/bjobs

LSF/bmod

LSF/lshosts

LSF/lsload

LSF/bsub

LSF/bacct

LSF/bhist

LSF/bhosts

LSF/bswitch

LSF/lsbatch.h

LSF/lsf.h

LSF/liblsf.a

LSF/liblsf.so

LSF/libbat.a

LSF/libbat.so

 

4. System Requirements

linux2.6-glibc2.3-x86_64

linux3.10-glibc2.17-x86_64

 

5. Installation and Configuration

5.1 Before installation

(LSF_TOP=Full path to the top-level installation directory of LSF.) 

1) Log on to the LSF master host as the LSF primary administrator 

2) Set your environment: 

- For csh or tcsh: % source LSF_TOP/conf/cshrc.lsf 

- For sh, ksh, or bash: $ . LSF_TOP/conf/profile.lsf 

 

5.2 Installation steps

1) Run badmin hclose all

2) Run badmin qinact all

3) Log on to the LSF master host as root and set the LSF cluster environment 

4) Go to the patch install directory: cd $LSF_ENVDIR/../10.1/install/ 

5) Copy the patch file to the install directory $LSF_ENVDIR/../10.1/install/ 

6) Run patchinstall: ./patchinstall <patch> 

 

5.3 After installation

1) Log on to the LSF master host as the LSF cluster primary administrator and set the LSF cluster environment

2) Run lsadmin limrestart all

3) Run lsadmin resrestart all

4) Run badmin hrestart all

5) Run badmin mbdrestart -s 

6) Run badmin hopen all

7) Run badmin qact all

 

5.4 Uninstallation

1) Log on to the LSF master host as the LSF cluster primary administrator and set the LSF cluster environment 

2) Run badmin hclose all 

3) Run badmin qinact all 

4) Log on to the LSF master host as root and set the LSF cluster environment 

5) Go to the patch install directory: cd $LSF_ENVDIR/../10.1/install/ 

6) Run ./patchinstall -r <patch> 

7) Log on to the LSF master host as the LSF cluster primary administrator and set the LSF cluster environment 

8) Run lsadmin limrestart all

9) Run lsadmin resrestart all 

10) Run badmin hrestart all

11) Run badmin mbdrestart -s

12) Run badmin hopen all 

13) Run badmin qact all 

 

 

 

6. List of Files

bsub bmod bjobs bacct bhist bhosts bswitch lshosts lsload lim res sbatchd mbatchd mbschd lsbatch.h lsf.h liblsf.a liblsf.so libbat.a libbat.so

 

7. Product Notifications

To receive information about product solution and patch updates automatically, subscribe to product notifications on the My notifications page ( www.ibm.com/support/mynotifications) on the IBM Support website (support.ibm.com). You can edit your subscription settings to choose the types of information you want to get notification about, for example, security bulletins, fixes, troubleshooting, and product enhancements or documentation changes.

 

 

8. Copyright and Trademark Information

©Copyright IBM Corporation 2020

 

U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

IBM®, the IBM logo, and ibm.com® are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at www.ibm.com/legal/copytrade.shtml.