IBM Spectrum LSF 10.1 Fix 535487 Readme File
Abstract
P103321. This fix enables users to connect to an existing job execution host for debugging and for general connectivity.
Description
Readme documentation for IBM Spectrum LSF 10.1 Fix 535487 including installation-related instructions, prerequisites and co-requisites, and list of fixes.
This fix addresses the following issue:
This fix introduces a new command named "battach", which will allow users to connect to the job execution host for debugging and general connectivity.
For example, for the following job:
bsub sleep 1000
root@hostA:~# bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
3283 root RUN normal hostA hostB *ep 100000 Nov 27 02:39
For the job with ID 3283, attach to the job execution host by issuing the following commands to check it:
root@hostA:~# battach 3283
Attaching job execution host: hostB for job: <3283>
# hostname
hostB
Readme File for: IBM® Spectrum LSF
Product/Component Release: 10.1
Update Name: Fix 535487
Fix ID: LSF-10.1-build535487
Publication Date: 17 December 2019
Last Modified Date: 17 December 2019
Contents
1. List of Fixes
2. Download Location
3. Product or Components Affected
4. System Requirements
5. Limitation
6. Installation and Configuration
7. List of Files
8. Product Notifications
9. Copyright and Trademark Information
1. List of Fixes
P103321
2. Download Locations
Download Fix 535487 from the following location: http://www.ibm.com/eserver/support/fixes/
3. Product or Components Affected
Affected product or components include:
LSF/bjobs
LSF/battach
LSF/mbatchd
LSF/mbschd
LSF/sbatchd
LSF/res
4. System Requirements
linux2.6-glibc2.3-x86_64
linux3.10-glibc2.17-x86_64
5. Limitations
a. If cgroup is not enabled in LSF, battach shell process cannot be controlled by the bstop/bresume/bkill -s commands
If cgroup is disabled (that is, LSF_PROCESS_TRACKING is disable in the lsf.conf file), the battach shell process on job execution hosts cannot be written into the job cgroup, so the battach shell process cannot be controlled by the bstop/bresume/bkill -s commands. Therefore, you must ensure that LSF_PROCESS_TRACKING is enabled for the battach command to work properly.
b. Cannot attach a parallel job execution host on which no job-related processes are running.
There is a limitation for parallel jobs that request multiple hosts, but where there are no job-related processes on some of the job execution hosts, for example:
bsub -n 4 -R "span[ptile=2]" sleep 1000
This job runs on only one execution host, but one or more execution hosts are assigned for the job.
If cgroup is enabled, but the job only runs on the first execution host and there are no job-related processes on the other job execution hosts, you will get the following error message if you try to attach to a job execution host on which no job-related processes running.
Attaching job execution host: <hostname> for job: <job_id>
Failed to attach to job: job_id on host: <hostname>
This is because LSF does not start job processes on job execution hosts other than the first, so there is no job process information saved in cgroups. If LSF allows users to attach to execution hosts other than the first, the process resource usage (such as memory, CPU, HDD, and GPU) is out of LSF control. The ideal method is to block users from attaching to job execution hosts other than the first for non-Docker parallel jobs that run on a single host.
c. Docker version issue
If the LSF execution host is running Docker version 18.03 or 18.06, when submitting a Docker job, then run battach on the Docker job, the battach command hangs, the job container no longer responds and the docker job cannot finish as expected . This is an issue with Docker. For more details, refer to the following GitHub ticket:
https://github.com/moby/moby/issues/37009
This issue is resolved in Docker, version 18.09, or later.
d. Docker container status becomes into "paused" or restart container
If a running Docker container is attached by running "docker exec -it <containerid> <shellname>", and a user runs "docker pause <containerid>" to pause the container, the attached shell process hangs.
If the user runs "docker restart <containerid>", the attached shell exits. The LSF battach command uses "docker exec -it <job_container_id> <shellname>" to attach a container job, and if the attached job is restarted with the brequeue or brestart command, the battach shell process exits.
e. The battach shell terminal is closed before the user exits the battach shell, and the remote shell process still exists on the job execution host.
If a running Docker container is attached by using "docker exec -it <containerid> <shellname>", and if the user closes the attached terminal directly without exiting the attached "docker exec" process, the attached shell does not exit automatically, which is the default Docker behavior. The "docker exec" process exits automatically after the job exits.
6. Installation and Configuration
6.1 Before installation
(LSF_TOP=Full path to the top-level installation directory of LSF.)
1) Log on to the LSF master host as root
2) Set your environment:
- For csh or tcsh: % source LSF_TOP/conf/cshrc.lsf
- For sh, ksh, or bash: $ . LSF_TOP/conf/profile.lsf
6.2 Installation steps
1) Go to the patch install directory: cd $LSF_ENVDIR/../10.1/install/
2) Copy the patch file to the install directory $LSF_ENVDIR/../10.1/install/
3) Run badmin hclose all
4) Run badmin qinact all
5) Run patchinstall: ./patchinstall <patch>
6.3 After installation
1) Log on to the LSF master host as root
2) Run lsadmin resrestart all
3) Run badmin hrestart all
4) badmin mbdrestart
5) Run badmin hopen all
6) Run badmin qact all
6.4 Uninstallation
1) Log on to the LSF master host as root
2) Run badmin hclose all
3) Run badmin qinact all
4) Go to the patch install directory: cd $LSF_ENVDIR/../10.1/install/
5) Run ./patchinstall -r <patch>
6) Run lsadmin resrestart all
7) Run badmin hrestart all
8) badmin mbdrestart
9) Run badmin hopen all
10) Run badmin qact all
7. List of Files
bjobs
battach
mbatchd
mbschd
sbatchd
res
8. Product Notifications
To receive information about product solution and patch updates automatically, subscribe to product notifications on the My notifications page ( www.ibm.com/support/mynotifications) on the IBM Support website (support.ibm.com). You can edit your subscription settings to choose the types of information you want to get notification about, for example, security bulletins, fixes, troubleshooting, and product enhancements or documentation changes.
9. Copyright and Trademark Information
©Copyright IBM Corporation 2019
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
IBM®, the IBM logo, and ibm.com® are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at www.ibm.com/legal/copytrade.shtml.