IBM Spectrum LSF 10.1 Fix 535487 Readme File

Abstract

P103321. This fix enables users to connect to an existing job execution host for debugging and for general connectivity.

Description

Readme documentation for IBM Spectrum LSF 10.1 Fix 535487 including installation-related instructions, prerequisites and co-requisites, and list of fixes.

This fix addresses the following issue:

This fix introduces a new command named "battach", which will allow users to connect to the job execution host for debugging and general connectivity.

For example, for the following job:

bsub sleep 1000

root@hostA:~# bjobs

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME

3283    root    RUN   normal     hostA       hostB       *ep 100000 Nov 27 02:39

For the job with ID 3283, attach to the job execution host by issuing the following commands to check it:

root@hostA:~# battach 3283

Attaching job execution host: hostB for job: <3283>

# hostname

hostB

Readme File for: IBM® Spectrum LSF

Product/Component Release: 10.1

Update Name: Fix 535487

Fix ID: LSF-10.1-build535487

Publication Date: 17 December 2019

Last Modified Date: 17 December 2019


Contents

1. List of Fixes

2. Download Location

3. Product or Components Affected

4. System Requirements

5. Limitation

6. Installation and Configuration

7. List of Files

8. Product Notifications

9. Copyright and Trademark Information


1. List of Fixes

P103321


2. Download Locations

Download Fix 535487 from the following location: http://www.ibm.com/eserver/support/fixes/


3. Product or Components Affected

Affected product or components include:

LSF/bjobs

LSF/battach

LSF/mbatchd

LSF/mbschd

LSF/sbatchd

LSF/res


4. System Requirements

linux2.6-glibc2.3-x86_64

linux3.10-glibc2.17-x86_64


5. Limitations

a. If cgroup is not enabled in LSF, battach shell process cannot be controlled by the bstop/bresume/bkill -s commands

If cgroup is disabled (that is, LSF_PROCESS_TRACKING is disable in the lsf.conf file), the battach shell process on job execution hosts cannot be written into the job cgroup, so the battach shell process cannot be controlled by the bstop/bresume/bkill -s commands. Therefore, you must ensure that LSF_PROCESS_TRACKING is enabled for the battach command to work properly.

 

b. Cannot attach a parallel job execution host on which no job-related processes are running.

There is a limitation for parallel jobs that request multiple hosts, but where there are no job-related processes on some of the job execution hosts, for example:

bsub -n 4 -R "span[ptile=2]" sleep 1000

This job runs on only one execution host, but one or more execution hosts are assigned for the job.

If cgroup is enabled, but the job only runs on the first execution host and there are no job-related processes on the other job execution hosts, you will get the following error message if you try to attach to a job execution host on which no job-related processes running.

Attaching job execution host: <hostname> for job: <job_id>

Failed to attach to job: job_id on host: <hostname>

This is because LSF does not start job processes on job execution hosts other than the first, so there is no job process information saved in cgroups. If LSF allows users to attach to execution hosts other than the first, the process resource usage (such as memory, CPU, HDD, and GPU) is out of LSF control. The ideal method is to block users from attaching to job execution hosts other than the first for non-Docker parallel jobs that run on a single host.

 

c. Docker version issue

If the LSF execution host is running Docker version 18.03 or 18.06, when submitting a Docker job, then run battach on the Docker job, the battach command hangs, the job container no longer responds and the docker job cannot finish as expected . This is an issue with Docker. For more details, refer to the following GitHub ticket:

https://github.com/moby/moby/issues/37009

This issue is resolved in Docker, version 18.09, or later.

 

d. Docker container status becomes into "paused" or restart container

If a running Docker container is attached by running "docker exec -it <containerid> <shellname>", and a user runs "docker pause <containerid>" to pause the container, the attached shell process hangs.

If the user runs "docker restart <containerid>", the attached shell exits. The LSF battach command uses "docker exec -it <job_container_id> <shellname>" to attach a container job, and if the attached job is restarted with the brequeue or brestart command, the battach shell process exits.

 

e. The battach shell terminal is closed before the user exits the battach shell, and the remote shell process still exists on the job execution host.

If a running Docker container is attached by using "docker exec -it <containerid> <shellname>", and if the user closes the attached terminal directly without exiting the attached "docker exec" process, the attached shell does not exit automatically, which is the default Docker behavior. The "docker exec" process exits automatically after the job exits.


6. Installation and Configuration

6.1 Before installation

(LSF_TOP=Full path to the top-level installation directory of LSF.)

1) Log on to the LSF master host as root

2) Set your environment:

- For csh or tcsh: % source LSF_TOP/conf/cshrc.lsf

- For sh, ksh, or bash: $ . LSF_TOP/conf/profile.lsf

6.2 Installation steps

1) Go to the patch install directory: cd $LSF_ENVDIR/../10.1/install/

2) Copy the patch file to the install directory $LSF_ENVDIR/../10.1/install/

3) Run badmin hclose all

4) Run badmin qinact all

5) Run patchinstall: ./patchinstall <patch>

6.3 After installation

1) Log on to the LSF master host as root

2) Run lsadmin resrestart all

3) Run badmin hrestart all

4) badmin mbdrestart

5) Run badmin hopen all

6) Run badmin qact all

6.4 Uninstallation

1) Log on to the LSF master host as root

2) Run badmin hclose all

3) Run badmin qinact all

4) Go to the patch install directory: cd $LSF_ENVDIR/../10.1/install/

5) Run ./patchinstall -r <patch>

6) Run lsadmin resrestart all

7) Run badmin hrestart all

8) badmin mbdrestart

9) Run badmin hopen all

10) Run badmin qact all


7. List of Files

bjobs

battach

mbatchd

mbschd

sbatchd

res


8. Product Notifications

To receive information about product solution and patch updates automatically, subscribe to product notifications on the My notifications page ( www.ibm.com/support/mynotifications) on the IBM Support website (support.ibm.com). You can edit your subscription settings to choose the types of information you want to get notification about, for example, security bulletins, fixes, troubleshooting, and product enhancements or documentation changes.



9. Copyright and Trademark Information

©Copyright IBM Corporation 2019


U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

IBM®, the IBM logo, and ibm.com® are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at www.ibm.com/legal/copytrade.shtml.