Readme for IBM Spectrum LSF 10.1 Simulator

Readme file for IBM® Spectrum LSF Predictor 10.1 Fix Pack 13 (601187)

Abstract

Readme documentation for IBM Spectrum LSF Predictor 10.1 Fix Pack 13 including installation-related instructions, prerequisites, and limitations.

Use the LSF Predictor to create and train artificial intelligence (AI) models for the prediction of LSF job memory and run time. It allows administrator to test the AI models and compare the results, also to deploy the models to the LSF production environments.

Description

Readme documentation for IBM Spectrum LSF Predictor V10.1 Fix Pack 13 that includes installation-related instructions, prerequisites and co-requisites, and list of new features.

Readme file for: IBM® Spectrum LSF Predictor
Product/Component Release: 10.1 Fix Pack 13
Publication date: 24 June 2022
Last modified date: 24 June 2022

What’s new

· Support AutoAI in IBM Cloud Pak for Data 4.0 running in public cloud environment.

· Upgrade ibm-watson-machine-learning API from 1.0.79 to 1.0.175.

· Force password change for first-time logon.

· Upgrade Elasticsearch version from 7.10.2 to 7.16.3 with jdk 17.0.3.1 to avoid Log4j and Java security vulnerability.

· Upgrade Python from 3.7.9 to 3.7.11 to avoid security vulnerability.

· Add utilization graph of cluster-level memory and slots usage for experiment results.

· Handle symbolic links (sym-link) when load LSF configurations.

Installation

1. System Requirements

RHEL X86_64 version 7.2 or later; or Centos X86_64 version 7.2 or later

Ubuntu X86_64, version 18.04

Docker-ce version 19.03.* or later is installed and running

2. Installation

1. Select a Linux host that has enough disk storage. The server must have internet access to download Docker-in-Docker (dind) image for starting LSF simulation Docker container. If you want to load LSF configuration directly, the host must be a client or a server in LSF production environment. The directory used to extract the package is the working directory. Make sure you have at least 100G of free space. The host memory is at least 32G.

2. Logon as the selected Linux as root. Check that the maximum virtual memory setting is larger than 262144.
# sysctl -n vm.max_map_count

3. Check that the Docker engine is installed and running, and the following ports are not occupied:
5050, 20022, 2222, 9200

4. Add the selected user, for example lsfadmin, to the Docker user group:
# usermod -aG docker lsfadmin

5. Login in again to the same host with the selected user (lsfadmin) and extract the LSF Cognitive package: lsf_cognitive_predictor_v1.tar.gz
$ tar -zxvf lsf_cognitive_predictor_v1.tar.gz

3. Prepare IBM Cloud Pak for Data AutoAI on public cloud

AutoAI in IBM Cloud Pak for Data service is provided by Watson Studio. You need to prepare the environment before you use it for LSF Predictor to train LSF job data.

1. Setup IBM Cloud Pak for Data AutoAI on public cloud account by completing the following steps: https://github.com/IBM/predict-insurance-charges-with-autoai#step-3-create-ibm-cloud-services

2. Create a Watson Studio project in Watson Studio web console. Go to main menu, select View all projects > New Project + and select Create an empty project.

3. Create a deployment space and record the space_id. Go to main menu, select View all spaces > New deployment space +. Fill in the fields, then click Create. In the list of spaces, click on the new deployment space name to see the space details. Copy and save space GUID as input of space ID in LSF Predictor settings page.

4. Generate an IBM Cloud API key (apikey). The apikey is the only credential to log on IBM Cloud Pak for Data AutoAI on public cloud. For more information about generating the apikey, see https://github.com/IBM/predict-insurance-charges-with-autoai#71-get-ibm-cloud-api-key.

4. Manage LSF Predictor service

1. As the selected user (lsfadmin) or root, start the service:

$ cd lsf_cognitive_v1/

$ ./bcogn start -v “<LSF_TOP_OUTSIDE_CONTAINER>:<LSF_TOP_INSIDE_CONTAINER>”

-v is optional. It is required only when you load LSF configuration directly in the web GUI.

-v option provides more volume or directory that is mounting from physical host (LSF_TOP_OUTSIDE_CONTAINER) to inside Docker container (LSF_TOP_INSIDE_CONTAINER).

You must use same volume path for the pair. The service provides default file volume mountings. You can use it as bridge to transfer files to the service container.

The following volume mountings are default from the service:

$PACKAGE_TOP is the full directory that is created when you extract the original package, for example: /opt/test/lsf_cognitive_v1

For volume path outside container volume path inside the container:

$PACKAGE_TOP $PACKAGE_TOP

$PACKAGE_TOP/work /opt/ibm/prediction/work

$PACKAGE_TOP/logs /opt/ibm/prediction/logs

$PACKAGE_TOP/config /opt/ibm/prediction/config
Note: If LSF production cluster is installed with LSF Suite, it might create a symbolic link from /opt/ibm/lsfsuite/lsf to /sharedir/lsf, in import LSF configurations from this production cluster directly. It is required to map both directories when you start the service. For example:

$ ./bcogn start -v “/opt/ibm/lsfsuite:/opt/ibm/lsfsuite;/sharedir/lsf:/sharedir/lsf”

2. As the selected user (lsfadmin) or root, stop the service:
$ ./bcogn stop
Showing the current service status:
$ ./bcogn status
SERVICE STATUS ID PORT
lsf_cognitive:v1 STARTED e370e9723742 5050, 9200, 20022
docker:dind STARTED c420d5ad63e2 2375,2376
webssh STARTED 22651 2222

3. Upgrade the current service to a new sub version or new package with same version:

a. Go to parent directory of current installation, then stop the service:
$ cd xxx
$ ./lsf_cognitive_v1/bcogn stop

b. Extract the new package to overwrite the old package:
$ tar -zxvf lsf_cognitive_v1.tar.gz

c. Upgrade and start the services with option -u:
$ ./lsf_cognitive_v1/bcogn start -u
Or upgrade without start:
$ ./lsf_cognitive_v1/bcogn upgrade

4. Uninstall the package:
The following command stops the services and removes Docker image: lsf_cognitive:v1 from image repository.
$ ./lsf_cognitive_v1/bcogn stop -f

5. Log on the LSF Cognitive service

1. Prepare your browser by importing the certificate as a trusted root certificate: config/https/cacert_lsf.pem

2. From your web browser, log on to the service by using the following URL: https://<SERVICE_HOST>:5050/
Username: Admin
Password: Admin

6. Use the LSF Cognitive service

Basic concepts:

Experiment

An experiment is an LSF simulation run that uses selected LSF configuration and workload snapshot.

LSF configuration

An LSF configuration is a full set of LSF cluster configuration and workload policies.

Workload Snapshot

A workload snapshot is a set of job submission and completion records that are imported from the LSF cluster events files (lsb.events*).

Prediction

A prediction is a process to generate job resource prediction mode, which includes collecting LSF job event data, clean data, send data to IBM Cloud Pak for Data Auto-AI for training, download model, and so on.

Inference

An inference is a service to predict a best value based on pre-training model for LSF newly submitted job request.

7. Data and logs

All the data that is related to experiments, LSF configurations, workload snapshots, prediction models are saved in the directory: lsf_cognitive_v1/work, which persists during service restart. You can back up the data to another directory anytime.
Service daemons logs are saved in the directory: lsf_cognitive_v1/logs

Limitations

· Dynamic hosts are not recognized. If the LSF cluster to be simulated contains dynamic hosts, the simulator data collecting tool tries to convert those dynamic hosts to static in the collected data. The tool then shows up as static hosts in the simulated cluster. The conversion adds those hosts to the lsf.cluster.<cluster_name> file in the conf directory in the collected data and adds the dynamic hosts into the hostlist line. The hosts are added as static hosts and is recognized by the simulator.

· If LSF Data Manager is enabled in LSF production environment, the experiments count 1024 CPUs for the data manager primary host, which leads to improper results. Disable the LSF Data Manager before you load the LSF configuration into LSF Cognitive system.

· Dynamic resources that are reported from elim are converted to fixed resources when you load LSF configuration.

· LSF Predictor 10.1 Fix Pack 13 only works with a single LSF cluster and does not support LSF multi-cluster mode.

· User groups that are defined in queue level as fairshare are not imported when import workload snapshot. User groups that are defined in bsub -G are imported properly.

Copyright and Trademark Information

U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

IBM®, the IBM logo, and ibm.com® are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at "Copyright and trademark information" at www.ibm.com/legal/copytrade.shtml.