Readme file for IBM Watson® Machine Learning Accelerator Interim Fix 536919  

 

Readme file for: IBM Watson® Machine Learning Accelerator
Product/Component Release: 1.2.1
Fix ID: dli-1.2.3-build536919-wmla

Publication date: January 24, 2020

 

This interim fix introduces enhancements and resolves the following issues in the IBM Deep Learning Impact 1.2.3 component in IBM Watson Machine Learning Accelerator 1.2.1:

·        Support for IBM Spectrum Conductor 2.4.1 with WML Accelerator 1.2.1

·        RFE 137803: Support for hyperparameter search plugins

·        RFE 137804: Addition of a user defined hyperparameter experiment

·        Fixed issues that hyperparameter optimization jobs remain in running state. Previously, when the hyperparameter optimization token expired after 8 hours, and tasks in RUNNING state were stopped and no longer updated. In this fix, the number of hours until expiry can be changed, and job tasks no longer remain in RUNNING state when a job is stopped.

·        Fixed performance issues with deep learning insight metrics in the cluster management console.

·        Fixed issues with obtaining the best search experiment result when loss value is negative.

·        Fixed issues involving job failure and the ps.conf in the temporary working directory.

·        Fixed issues with creating a hyperparameter optimizing task to use the set values provided by the experiment when running training job.

·        Enhanced the elastic distributed training API to include elastic distributed training worker logger callback for test metrics.

 

 

  

Contents

1.     Download location 

2.     Products or components affected

3.     Installation and configuration

4.     Uninstallation

5.     Installing and using hyperparameter search plugins (RFE 137803)

6.     User defined hyperparameter experiment (RFE 137804)

7.   EDT worker logger callback to handle test metric

8.     List of files

9.     Product notifications 

10.  Copyright and trademark information

 

1.   Download location

Download interim fix 536919 (dli-1.2.3-build536919-wmla) from the following location http://www.ibm.com/eserver/support/fixes/.

 

2.   Products or components affected

Component Name, Platform, Fix ID:

DLPD, GUI, EDT, dlinsights

Linux x86_64, Linux ppc64le

dli-1.2.3-build536919-wmla

 

3.   Installation and configuration

3.1 Before installation

Before installing the interim fix, complete the following steps to prepare your environment, complete the rolling upgrade to IBM Spectrum Conductor 2.4.1, and ensure that any required interim fixes are installed. 

 

1.      Log on to the master host as a cluster administrator (CLUSTERADMIN) and source the environment according to your shell environment. 

For sh, ksh or bash:

> . $EGO_TOP/profile.platform

> export DLI_SHARED_FS=$DLI_SHARED_FS

> export CLUSTERADMIN=$CLUSTERADMIN

> export DLI_CONDA_HOME=$DLI_CONDA_HOME

> export WLP=$EGO_TOP/wlp/usr/servers/gui/apps/dli/1.2.3/

where EGO_TOP is IBM Spectrum Conductor Deep Learning Impact installation path, and DLI_SHARED_FS, CLUSTERADMIN, DLI_CONDA_HOME must be the same as the IBM Spectrum Conductor Deep Learning Impact installation setting.

 

For csh or tcsh:

> source $EGO_TOP/cshrc.platform

> setenv DLI_SHARED_FS $DLI_SHARED_FS

> setenv CLUSTERADMIN $CLUSTERADMIN

> setenv DLI_CONDA_HOME $DLI_CONDA_HOME

> setenv WLP $EGO_TOP/wlp/usr/servers/gui/apps/dli/1.2.3/

where EGO_TOP is IBM Spectrum Conductor Deep Learning Impact installation path, and DLI_SHARED_FS, CLUSTERADMIN, DLI_CONDA_HOME must be the same as the IBM Spectrum Conductor Deep Learning Impact installation setting.

 

2.      Create a backup directory, for example /build536919_backup, and back up the following files:

> cp $DLI_SHARED_FS/tools/tune/bayes_opt_crl.py /build536919_backup

> cp $DLI_SHARED_FS/tools/tune/util/bayesian_opt_utils.py /build536919_backup

> cp $DLI_SHARED_FS/tools/tune/util/sobol_lib.py /build536919_backup

> cp $DLI_SHARED_FS/conf/model_files_control_util.sh /build536919_backup

> cp $DLI_SHARED_FS/conf/spark-env.sh /build536919_backup

> cp $DLI_SHARED_FS/tools/dl_plugins/common_wrapper.sh /build536919_backup

> cp $DLI_SHARED_FS/tools/dl_plugins/dlioptgen.py /build536919_backup

> cp $DLI_SHARED_FS/tools/dl_plugins/edtPyTorchCmdGen.py /build536919_backup

> cp $DLI_SHARED_FS/fabric/1.2.3/libs/fabric.zip /build536919_backup

> cp $EGO_TOP/dli/1.2.3/dlpd/lib/*5.4.2*.jar /build536919_backup

> cp $EGO_TOP/dli/1.2.3/dlpd/lib/lucene-*.jar /build536919_backup

> cp $EGO_TOP/integration/elk/conf/indexer/dlinsights_logstash_driver.conf /build536919_backup

> cp $EGO_TOP/integration/elk/conf/indexer/dlinsights_logstash_process.conf /build536919_backup

 

3.     Run the DLI_export_Insights.sh script to export data from IBM Spectrum Conductor Deep Learning Impact. You will import this data later after you complete the installation of interim fix 536919.

      a. Download the export and import scripts from IBM Cloud and copy them to $EGO_TOP/integration/elk/1.4.3/scripts.

b. Run the DLI_export_Insights.sh script:

./DLI_export_Insights.sh Admin Admin export_directory

 

4.      Upgrade to IBM Spectrum Conductor 2.4.1 in your IBM Watson Machine Learning Accelerator 1.2.1 installation. Before you upgrade to Spectrum Conductor 2.4.1, ensure that you have installed the required fixes. Depending on which fixes you have previously applied to your IBM Watson Machine Learning Accelerator 1.2.1 installation, you must ensure that you are following one of the supported upgrade paths: 

·        If you have previously applied interim fix 527174 and upgraded IBM Spectrum Conductor version 2.3.0 to 2.4.0, you can now perform an upgrade from IBM Spectrum Conductor version 2.4.0 to 2.4.1.  To complete the upgrade, follow the upgrade documentation for IBM Spectrum Conductor 2.4.1 in the IBM Knowledge Center using the conductor2.4.1-upgrade-wmla-x86_64.bin package for x86_64 and conductor2.4.1-upgrade-wmla-ppc664le.bin package for ppc64le.

·        If you have not previously applied interim fix 527174, you can now perform an upgrade from IBM Spectrum Conductor version 2.3.0 to 2.4.1. To complete the upgrade, follow the upgrade documentation for IBM Spectrum Conductor 2.4.1 in the IBM Knowledge Center using the conductor2.4.1-upgrade-wmla-x86_64.bin package for x86_64 and conductor2.4.1-upgrade-wmla-ppc664le.bin package for ppc64le. IMPORTANT: After you successfully complete the upgrade, you must apply interim fix 527174 before proceeding with the installation of interim fix 536919. You do no need to upgrade Spark packages after installing interim fix 527174.

 

3.2 Installation steps

After successfully upgrading to IBM Spectrum Conductor 2.4.1 and installing interim fix 527174, apply interim fix 536919 by completing the following steps.

 

1.      Log on to the master host as the cluster administrator (CLUSTERADMIN) and source the environment.

2.      Delete old Elasticsearch template:

> export ELK_ESHTTP_PORT=$(grep ^ELK_ESHTTP_PORT $EGO_CONFDIR/../../integration/elk/conf/elk.conf|awk -F'=' '{print $2}')

> curl -k -u Admin:Admin -X DELETE https://$(hostname -f):$ELK_ESHTTP_PORT/_template/*

 

3.      Stop the following services:

> egosh service stop dlpd dlinsights-monitor dlinsights-optimizer

> egosh service stop plc WEBGUI REST ascd

> egosh service stop elk-shipper elk-indexer elk-elasticsearch-data elk-elasticsearch elk-elasticsearch-master elk-manager

 

4.      As the root user, on each management host, reinstall the dlinsights conda environment: 

> source ${DLI_CONDA_HOME}/etc/profile.d/conda.sh

> conda remove --name dlinsights --all --yes

> conda create --name dlinsights --yes pip python=3.6

> conda activate dlinsights

> conda install --yes numpy==1.12.1 pyopenssl==18.0.0 flask==1.0.2 Flask-Cors==3.0.3 scipy==1.0.1 SQLAlchemy==1.1.13 requests==2.21 alembic==1.0.5 pathlib==1.0.1

> pip install --no-cache-dir elasticsearch==7.1.0 Flask-Script==2.0.5 Flask-HTTPAuth==3.2.2 mongoengine==0.11.0

> conda deactivate

 

5.      As the cluster administrator (CLUSTERADMIN), on each management host, download the packages to a directory. For example, packages can be downloaded to the /dlifixes directory.

6.      Run the egoinstallfixes command to install cluster jars.

For Linux x86_64:

> egoinstallfixes /dlifixes/dlicore-1.2.3.0_x86_64_build536919.tar.gz

> egoinstallfixes /dlifixes/dlimgmt-1.2.3.0_x86_64_build536919.tar.gz

 

For Linux ppc64le:

> egoinstallfixes /dlifixes/dlicore-1.2.3.0_ppc64le_build536919.tar.gz

> egoinstallfixes /dlifixes/dlimgmt-1.2.3.0_ppc64le_build536919.tar.gz

 

NOTE: Running the egoinstallfixes command automatically backs up the current binary files to a fix backup directory for recovery purposes. Do not delete this backup directory; you will need it if you want to recover the original files. For more information on using this command, see the egoinstallfixes command reference.

 

7.      Run the pversions command to verify the installation:

> pversions -b 536919

 

8.      (Optional) If you have previously applied interim fix 527174 and upgraded IBM Spectrum Conductor version 2.3.0 to 2.4.0, perform below step to update the cluster management console:

> webapp_dlgui='<include optional="true" location="${env.EGO_CONFDIR}/../../gui/conf/webapp/webapp_dlgui.xml"/>'

> webapp_dlguiv5='<include optional="true" location="${env.EGO_CONFDIR}/../../gui/conf/webapp/webapp_dlguiv5.xml"/>'

> sed -i '/dlgui/d' $EGO_CONFDIR/../../gui/conf/webapp/server_internal.xml

> sed -i "13a $webapp_dlgui" $EGO_CONFDIR/../../gui/conf/webapp/server_internal.xml

> sed -i "14a $webapp_dlguiv5" $EGO_CONFDIR/../../gui/conf/webapp/server_internal.xml

> sed -i "s#@EDI_HELP@#<Help role=\"\" display=\"Elastic Distributed Inference RESTful API\" url=\"/dlgui/apidocs/edi/index.html\" width=\"1280\" height=\"720\" />#" $EGO_CONFDIR/../../gui/conf/help/pmc_DLI_help.xml

> sed -i 's#urlPrefix="true"/>#/>#g' $EGO_CONFDIR/../../gui/conf/help/pmc_DLI_help.xml

> sed -i 's#gotoAbout.do#gotoAbout.controller#g' $EGO_CONFDIR/../../gui/conf/help/pmc_DLI_help.xml 

 

9.      If you have EGO_SHARED_TOP configured, copy the hpo_algo.conf file to EGO_SHARED_TOP:

> cp ${EGO_TOP}/dli/conf/dlpd/hpo_algo.conf ${EGO_SHARED_TOP}/dli/conf/dlpd

  

10.   Log on to the master host as a cluster administrator (CLUSTERADMIN), download the dli-1.2.3.0_build536919_share.tar.gz package and extract its contents to the top-level $DLI_SHARED_FS directory:

> tar zoxf dli-1.2.3.0_build536919_share.tar.gz -C $DLI_SHARED_FS

 

11.   Ensure that the dli-1.2.3.0_build536919_share.tar.gz patch files are set with correct permissions:

> chmod 755 -R $DLI_SHARED_FS/tools/tune/plugins

> chmod 755 $DLI_SHARED_FS/conf/model_files_control_util.sh

> chmod 775 $DLI_SHARED_FS/conf/spark-env.sh

> chmod 755 $DLI_SHARED_FS/tools/tune/bayes_opt_crl.py

> chmod 755 $DLI_SHARED_FS/tools/tune/util/bayesian_opt_utils.py

> chmod 755 $DLI_SHARED_FS/tools/tune/util/sobol_lib.py

> chmod 755 $DLI_SHARED_FS/tools/dl_plugins/common_wrapper.sh

> chmod 755 $DLI_SHARED_FS/tools/dl_plugins/dlioptgen.py

> chmod 755 $DLI_SHARED_FS/tools/dl_plugins/shipper.yml

> chmod 755 $DLI_SHARED_FS/tools/dl_plugins/shipper.sh

> chmod 755 $DLI_SHARED_FS/tools/dl_plugins/edtPyTorchCmdGen.py

> chmod 775 $DLI_SHARED_FS/fabric/1.2.3/libs/fabric.zip

 

12.   Delete old Elasticsearch files 

> rm $EGO_TOP/integration/elk/conf/indexer/dlinsights_logstash_driver.conf 

> rm $EGO_TOP/integration/elk/conf/indexer/dlinsights_logstash_process.conf 

 

13.      Start the following services:

> egosh service start elk-manager elk-elasticsearch-master elk-elasticsearch elk-elasticsearch-data elk-indexer elk-shipper

> egosh service start dlpd dlinsights-monitor dlinsights-optimizer

> egosh service start plc WEBGUI REST ascd

 

14.   Install dependent python module in the default Python 3 conda environment:

> source ${DLI_CONDA_HOME}/etc/profile.d/conda.sh

> conda activate dlipy3

> pip install --yes dill==0.3.1.1

> conda deactivate

 

15.   Import back the data that you exported during the IBM Spectrum Conductor upgrade to version 2.4.1. 

a.      Change to the $EGO_TOP/integration/elk/1.4.3/scripts directory in IBM Spectrum Conductor 2.4.1:

> cd $EGO_TOP/integration/elk/1.4.3/scripts

b.      Run the import script:

> ./DLI_import_Insights.sh Admin Admin export_directory

 

16.   Log out of the cluster management console, clear your browser cache, and relogin.

 

4.   Uninstallation

If required, follow the instructions in this section to uninstall this interim fix on hosts in your cluster.

 

1.      Log in to the management host as a cluster administrator (CLUSTERADMIN) and source the environment.

2.      Stop the following services:

> egosh service stop elk-shipper elk-indexer elk-elasticsearch-data elk-elasticsearch elk-elasticsearch-master elk-manager

> egosh service stop dlpd dlinsights-monitor dlinsights-optimizer

> egosh service stop plc WEBGUI REST ascd

 

3.      Log on to each management host in the cluster and roll back this interim fix:

> egoinstallfixes -r 536919

 

4.      Restore the files that you backed up during installation under $DLI_SHARED_FS.

> cp /build536919_backup/bayes_opt_crl.py $DLI_SHARED_FS/tools/tune

> cp /build536919_backup/bayesian_opt_utils.py $DLI_SHARED_FS/tools/tune/util

> cp /build536919_backup/sobol_lib.py $DLI_SHARED_FS/tools/tune/util

> cp /build536919_backup/model_files_control_util.sh $DLI_SHARED_FS/conf

> cp /build536919_backup/spark-env.sh $DLI_SHARED_FS/conf

> cp /build536919_backup/common_wrapper.sh $DLI_SHARED_FS/tools/dl_plugins

> cp /build536919_backup/dlioptgen.py $DLI_SHARED_FS/tools/dl_plugins

> cp /build536919_backup/edtPyTorchCmdGen.py $DLI_SHARED_FS/tools/dl_plugins

> cp /build536919_backup/fabric.zip $DLI_SHARED_FS/fabric/1.2.3/libs

> rm -v $EGO_TOP/dli/1.2.3/dlpd/lib/*7.2.1*.jar

> rm -v $EGO_TOP/dli/1.2.3/dlpd/lib/lucene-*.jar

> rm -v $EGO_TOP/dli/1.2.3/dlpd/lib/httpasyncclient-4.1.2.jar

> rm -v $EGO_TOP/dli/1.2.3/dlpd/lib/httpcore-nio-4.4.5.jar

> cp /build536919_backup/*5.4.2*.jar $EGO_TOP/dli/1.2.3/dlpd/lib

> cp /build536919_backup/lucene-*.jar $EGO_TOP/dli/1.2.3/dlpd/lib

> cp /build536919_backup/dlinsights_logstash_driver.conf $EGO_TOP/integration/elk/conf/indexer

> cp /build536919_backup/dlinsights_logstash_process.conf $EGO_TOP/integration/elk/conf/indexer

 

5.      On each management host, reinstall dlinsights conda environment: 

> source ${DLI_CONDA_HOME}/etc/profile.d/conda.sh

> conda remove --name dlinsights --all --yes

> conda create --name dlinsights --yes pip python=3.6

> conda activate dlinsights

> conda install --yes numpy==1.12.1 pyopenssl==18.0.0 flask==1.0.2 Flask-Cors==3.0.3 scipy==1.0.1 SQLAlchemy==1.1.13 requests==2.21 alembic==1.0.5 pathlib==1.0.1

> pip install --no-cache-dir elasticsearch==5.2.0 Flask-Script==2.0.5 Flask-HTTPAuth==3.2.2 mongoengine==0.11.0

> conda deactivate

 

6.      Start the following services:

> egosh service start elk-manager elk-elasticsearch-master elk-elasticsearch elk-elasticsearch-data elk-indexer elk-shipper

> egosh service start dlpd dlinsights-monitor dlinsights-optimizer

> egosh service start plc WEBGUI REST ascd

 

 

5.   Installing and using hyperparameter search plugins (RFE 137803)

 

Interim fix 536919 supports a new method for adding new hyperparameter search algorithms to WML Accelerator as plugins.

 

5.1 Hyperparameter search plugin installation and management

 

5.1.1 Installing hyperparameter search plugins

There are 2 methods to install hyperparameter search plugins, they are: 

1.      Local install: Use this method if the plugin scripts are located on the dlpd server, or in a shared folder that the dlpd service can access.

2.      Upload install: Use this method if you want to upload the plugin scripts from a client host.

 

Note: If the script files are big, suggest to copy the files to server first and the use the local install method. 

 

5.1.1.1 Local Install 

Use the following rest command to submit a local installation request:

> curl -k -u ${clusteruser}:${password} -X POST \

     -H 'Accept: application/json' \

     -H 'Content-Type: multipart/form-data' \

     -F data='{"name": "${plugin_algorithm_name}", "path": "${plugin_path_in_server}", "condaHome": "${conda_home_path}", "condaEnv": "${conda_env_name}", "remoteExec": ${enable_remote_execution}, "logLevel": "${log_level}"}’ \ 

     'https://${host_name}:${dlpd_rest_port}/platform/rest/deeplearning/v1/hypersearch/algorithm/install'

 

Parameter name

Type

Description

Required

plugin_algorithm_name

string

Name of the user plugin algorithm.

The plugin algorithm name should start with a letter, and contain only alphanumeric characters and dashes, up to 100 characters. 

The plugin algorithm name should not be same with the built-in algorithm names, such as Random, TPE, Bayesian, Hyperband and ExperimentGridSearch.

true

plugin_path_in_server

string

Full path of the user plugin algorithm.

The path should exist in the dlpd service server host. Please make sure the path and can be accessed by the CLUSTERADMIN operating system user and has the read and execution permissions, as well as all the parent path. 

The path should contain a optimizer.py file which implements the BasePluginOptimizer class (see section 5.2 for details).

true

conda_home_path

string

Full path of the user conda home used to run the plugin scripts.

The path should exist on the management hosts and must be accessed by the CLUSTERADMIN operating system user.

If not specified, DLI_CONDA_HOME is used. 

If specified, the conda_env_name must be specified.

false

conda_env_name

string

Name of the user conda environment where the plugin scripts are run. 

If conda_home_path is specified, the conda environment specified in conda_home_path is used to run the plugin scripts. Otherwise, if not specified, DLI_CONDA_HOME is used.

false

enable_remote_execution

boolean

Specify whether to enable the execution mode of the plugin algorithm, valid values:

- true: enable remote execution mode. WML Accelerator HPO starts a Spark job to run the plugin scripts remotely on a compute host.

- false: disable remote execution mode. The plugin scripts run on the dlpd service host (management host).

The default value is false. Enabling remote execution mode can save management host resources, but  can impact the time it takes to start Spark applications.

false

log_level

string

Specify the log level for the plugin algorithm, such as: INFO, DEBUG, WARN, ERROR, FATAL.

The default value is INFO.

false

host_name

string

The dlpd rest host, check with “egosh client view DLPD_REST_BASE_URL_1” command.

true

dlpd_rest_port

string

The dlpd rest port, check with “egosh client view DLPD_REST_BASE_URL_1” command.

true

 

Note:  

·        Only Python 3 is supported for plugin tuning. 

·        Make sure dependent Python modules are installed in the conda environment:

> source ${conda_home_path}/etc/profile.d/conda.sh

> conda activate ${conda_env_name}

> pip install --yes dill== 0.3.1.1 glog==0.3.5 protobuf==3.7.1

> conda deactivate

 

5.1.1.2 Upload Install 

Use the following rest command to submit a local installation request: 

> curl -k -u ${clusteruser}:${password} -X POST \

     -H 'Accept: application/json' \

     -H 'Content-Type: multipart/form-data' \

     -F file=@${plugin_tar_file_path} \

     -F data='{"name": "${plugin_algorithm_name}", "condaHome": "${conda_home_path}", "condaEnv": "${conda_env_name}", "remoteExec": ${enable_remote_execution}, "logLevel": "${log_level}"}' \ 

     'https://${host_name}:${dlpd_rest_port}/platform/rest/deeplearning/v1/hypersearch/algorithm/install'

 

Parameter name

Type

Description

Required

plugin_algorithm_name

string

Name of the user plugin.

The plugin name should start with a letter, and only contains letters, numbers and dashes only, up to 100 characters

true

plugin_tar_file_path

string

The full path of the user plugin algorithm tar file. The tar file should end with suffix “.tar”. 

true

conda_home_path

string

The full path of the user conda home used to run the plugin scripts.

The path should exist in management hosts and can be access by the CLUSTERADMIN operating system user.

If not specified, will use the default DLI_CONDA_HOME. If specified, the conda_env_name should also be specified.

false

conda_env_name

string

The user conda env name used to run the plugin scripts

If conda_home_path is specified, it will use the conda env in conda_home_path to run the plugin scripts. Otherwise the DLI_CONDA_HOME will be used.

false

enable_remote_execution

boolean

Specify whether to enable the execution mode of the plugin algorithm, valid values:

- true: enable remote execution mode. WML Accelerator HPO starts a Spark job to run the plugin scripts remotely on a compute host.

- false: disable remote execution mode. The plugin scripts will run in the dlpd service host (management host).

The default value is false. Enable remote execution mode could save management host resource, but it could take extra time to start spark application.

false

 

log_level

string

Specify the log level for the plugin algorithm, such as: INFO, DEBUG, WARN, ERROR, FATAL.

The default value is INFO.

false

host_name

string

The dlpd rest host, check with “egosh client view DLPD_REST_BASE_URL_1” command.

true

dlpd_rest_port

string

The dlpd rest port, check with “egosh client view DLPD_REST_BASE_URL_1” command.

true

 

Note:  

·        Only Python 3 is supported for the plugin tuning. 

·        Make sure dependent python modules are installed in the conda env:

> source ${conda_home_path}/etc/profile.d/conda.sh

> conda activate ${conda_env_name}

> pip install --yes dill== 0.3.1.1 glog==0.3.5 protobuf==3.7.1

> conda deactivate

·        If plugin_tar_file_path is specified and the path is also set in the body of the data, the path value will not take effect. The uploaded files found in the location specified by  plugin_tar_file_path will be used.

·        How to tar your plugin scripts into a plugin tar file:

1)     Put all your scripts under a plugin root folder, for example: /tmp/your_plugin_root_folder, the folder must contain a optimizer.py file which implements the BasePluginOptimizer class (see section 5.2 for details), together with other modules and files for the hyperparameter search plugin.

2)     Tar the plugin scripts:

> cd /tmp/your_plugin_root_folder/

> tar cf /tmp/test.tar *

3)     Replace “${plugin_tar_file_path}” in the request command with /tmp/test.tar.

 

5.1.2 Query plugin algorithms 

 

You can query plugin algorithms by type or name.

 

5.1.2.1 Query plugins by type

Use the following rest command to query all supported tuning algorithms:

> curl -k -u ${clusteruser}:${password} -X GET \

     -H 'Accept: application/json' \

     'https://${host_name}:${dlpd_rest_port}/platform/rest/deeplearning/v1/hypersearch/algorithm?type=${algo_type}'

 

Parameter name

Type

Description

Required

algo_type

string

The type of tuning algorithm.

Specify type=USER_PLUGIN to query all the user installed plugin algorithm. 

Specify type=BUILD_IN to query all the build-in algorithms.

If the parameter is not specified, all algorithms (including build-in and user installed) will be returned.

false

host_name

string

The dlpd rest host, check with “egosh client view DLPD_REST_BASE_URL_1” command.

true

dlpd_rest_port

string

The dlpd rest port, check with “egosh client view DLPD_REST_BASE_URL_1” command.

true

 

5.1.2.2 Query plugin algorithms by name

Use the following rest command to query all supported tuning algorithms:

> curl -k -u ${clusteruser}:${password} -X GET \

     -H 'Accept: application/json' \

     'https://${host_name}:${dlpd_rest_port}/platform/rest/deeplearning/v1/hypersearch/algorithm/${plugin_algorithm_name}'

 

Parameter name

Type

Description

Required

plugin_algorithm_name

string

name of the user plugin algorithm.

true

host_name

string

The dlpd rest host, check with “egosh client view DLPD_REST_BASE_URL_1” command.

true

dlpd_rest_port

string

The dlpd rest port, check with “egosh client view DLPD_REST_BASE_URL_1” command.

true

 

5.1.3 Deleting a plugin 

Use the following rest command to delete a user plugin algorithm:

> curl -k -u ${clusteruser}:${password} -X DELETE \

     'https://${host_name}:${dlpd_rest_port}/platform/rest/deeplearning/v1/hypersearch/algorithm/${plugin_algorithm_name}

 

Parameter name

Type

Description

Required

plugin_algorithm_name

string

name of the user plugin algorithm.

true

host_name

string

The dlpd rest host, check with “egosh client view DLPD_REST_BASE_URL_1” command.

true

dlpd_rest_port

string

The dlpd rest port, check with “egosh client view DLPD_REST_BASE_URL_1” command.

true

 

5.2 Developing a plugin

In this section, the random search algorithm is used as an example to demonstrate how to implement a hyperparameter search plugin. Th full example code can be found in the dli-1.2.3.0_build536919_example.tar file.

 

5.2.1 Creating a plugin

1.      Create a Python file named optimizer.py, which will act as the entrance for WML Accelerator to call the plugin.

2.      In optimizer.py, create a class named PluginOptimizer to extend the  BasePluginOptimizer:

from plugins.core.logger import logger

from plugins.core.base_plugin_opt import BasePluginOptimizer

class PluginOptimizer(BasePluginOptimizer):

    def __init__(self, name, hyper_parameters, **kwargs):

        super(PluginOptimizer, self).__init__(name, hyper_parameters, **kwargs)

    def search(self, number_samples, last_exp_results):

        exp_list = []

        #### implement your algorithm logic here ###

        return exp_list

 

The class name PluginOptimizer must not be changed. All plugin classes must extend BasePluginOptimizer. It is recommended that you only make minimal changes to the search method and only override other functions if necessary. The following table has a detailed description of the BasePluginOptimizer functions.

Function

Parameters

Return

Description

Required

__init__

- name, string, plugin optimizer name

- hyper_parameters, list, hyper parameters that need to be tuned

- kwargs, dict, algorithm parameters

/

init is called once the plugin is initialized. 

The hyperparameters and algorithm parameters are defined in the task submission rest body and are passed to the init function.

false

search

- number_samples, integer, number of hyperparameter candidates requested

- last_exp_results, list, the execution results of last suggested hyperparameter sets

- hyper_params, list, suggested hyper-parameter sets to run

User must implement this search function.

For each search loop, dlpd daemon will call the search function to compute next hyper-parameter candidates, and then starts training workload for each hyper-parameter set. After all the trainings are done, the training scores(loss/accuracy) will be passed to next round of search.

true

get_state

/

- state_dict, dict, the algorithm states to be saved

get_state automatically be called AFTER calling the search function to save the plugin algorithm internal states. The saved states will be passed to the next set_state function for algorithm status recovery (see section 5.2.4 for details).

false

set_state

- state_dict, dict, the algorithm states to be recovered

/

set_states automatically be called BEFORE calling the search function to restore the plugin algorithm internal states (see section 5.2.4 for details).

false

 

Note: save and restore are reserved functions used by BasePluginOptimizer to handle the save and restore logic. Please avoid using these two function names in PluginOptimizer.

 

5.2.2 Implement random search logic

Random Search algorithm sets up a search space of hyperparameter values and selects random combinations as the next training candidates. In the search function, you need to parse the hyper_parameters parameter and search the value space for each hyperparameter.

 

1.      In the init fuction, save the hyper_parameters parameter as an instance variable _hyper_parameters, so it can be used in the search function. Also, create an instance variable _exp_history to store all the experiment history results. You do not need to save the experiment history for the random search algorithm. However, for other algorithms such as Bayesian search, it is required to compute new experiment candidates based on the history results.

    def __init__(self, name, hyper_parameters, **kwargs):

        super(PluginOptimizer, self).__init__(name, hyper_parameters, **kwargs)

        

        logger.info("all tuning hyper parameters: \n{}".format(hyper_parameters)) # get all hyper parameters that need to be tuned

        self._hyper_parameters = hyper_parameters

        self._exp_history = []

 

The output format of the hyper_parameters:

[

       {

           'name': 'required, string, hyperparameter name, the same name will be used in the config.json so user model can load it',

           'type': 'required, string, one of Range, Discrete',

           'dataType': 'required,string, one of INT, DOUBLE, STR,

           'minDbVal': 'double, required if type=Range and datatype=double',

           'maxDbVal': 'double, required if type=Range and datatype=double',

           'minIntVal': 'int, required if type=Range and datatype=int',

           'maxIntVal': 'int, required if type=Range and datatype=int',

           'discreteDbVal': 'double, list like [0.1, 0.2], required if type=Discrete and dataType=double',

           'discreteIntVal': 'int, list like [1, 2], required if type=Discrete and datatype=int',

           'discreateStrVal': 'string, list like ['1', '2'], required if type=Discrete and datatype=str',

           'power': 'a number value in string format, the base value for power calculation. ONLY valid when type is Range',

           'step': 'a number value in string format, step size to split the Range space. ONLY valid when type is Range',

           'userDefined': 'boolean, indicate whether the parameter is a user defined parameter or not'

       }

]

 

An example output of the above code:

all tuning hyper parameters:

[{'name': 'base_lr', 'type': 'Range', 'dataType': 'DOUBLE', 'minDbVal': 0.01, 'maxDbVal': 0.1, 'userDefined': False}]

 

2.      Implement the search function:

    def search(self, number_samples, last_exp_results):

 

        logger.info("last exps results:\n{}".format(last_exp_results))

        if not last_exp_results is None and len(last_exp_results) > 0:

            self._exp_history.extend(last_exp_results)

        

        # start random search of the hyper-parameters

        exp_list = []

        for i in range(number_samples):

            hypers = {}

            for hp in self._hyper_parameters:

                type = hp.get('type')

                if type == "Range":

                    val = self._getRandomValueFromRange(hp)

                elif type == "Discrete":

                    val = self._getRandomValueFromDiscrete(hp)

                else:

                    raise Exception("un-supported type {} for random search.".format(type))

                hypers[hp.get('name')] = val

            exp_list.append(hypers)

            

        logger.info("suggest next exps list:\n{}".format(exp_list))

        return exp_list

 

3.      Continue to implement the _getRandomValueFromRange and the _getRandomValueFromDiscrete  functions:

def _getRandomValueFromRange(self, hp):

 

        data_type = hp.get('dataType')

        if data_type == "DOUBLE":

            val = hp.get('minDbVal') + np.random.rand() * (hp.get('maxDbVal') - hp.get('minDbVal'))

        elif data_type == "INT":

            val = np.random.randint(hp.get('minIntVal'), hp.get('maxIntVal'))

        else:

            raise Exception("un-supported data type {} for random range search.".format(data_type))

        

        logger.debug("next {} val: {}".format(hp.get('name'), val))

        return val 

 

    def _getRandomValueFromDiscrete(self, hp):

                

        data_type = hp.get('dataType')

        if data_type == "DOUBLE":

            vals = hp.get('discreteDbVal')

        elif data_type == "INT":

            vals = hp.get('discreteIntVal')

        else:

            vals = hp.get('discreateStrVal')

        val = vals[np.random.randint(len(vals))]

        

        logger.debug("next {} val: {}".format(hp.get('name'), val))

        return val

 

   An example output of the code of 2 and 3, with number_samples=1

last exps results:

[{'id': 0, 'score': 3.593962, 'hyperparameters': {'base_lr': 0.08849518263874222}}]

next base_lr val: 0.09288991388261642

suggest next exps list:

[{'base_lr': 0.09288991388261642}]

 

NOTE: The returned exp_list is a list of hyperparameter key-value dictionaries. Each hyperparameter key-value dictionary must include all hyperparameters that need to be tuned, otherwise an Exception is thrown.

 

5.2.3 Handling algorithm parameters

A tuning algorithm can have some parameters that allow users to specify when submitting each tuning task. To demonstrate this, the random_seed parameter was added to the random search algorithm.

 

1.      Specify the random_seed parameter when submitting a task, by adding the below configuration to the algoDef part of the rest body:

            

"algoDef": {

…,

"algoParams": [{

"name": "random_seed",

"value": 2

            }]

      }

 

2.      Parse the algorithm parameters in the init function:

 

    def __init__(self, name, hyper_parameters, **kwargs):

        super(PluginOptimizer, self).__init__(name, hyper_parameters, **kwargs)

        

        logger.debug("all tuning hyper parameters: \n{}".format(hyper_parameters)) # get all hyper parameters that need to be tuned

        self._hyper_parameters = hyper_parameters

        self._exp_history = []

 

  # get all optimizer search parameters that user passed

        logger.info("all optimizer search parameters: \n{}".format(kwargs))

 

        # get optimizer parameters, the parameters value is string

        if kwargs.get('random_seed'):

            self._random_seed = int(kwargs.get('random_seed'))

            np.random.seed(self._random_seed)

   

An example output of the add code:

all optimizer search parameters:

{'random_seed': '2'}                                      

 

5.2.4 Handling algorithm internal states

User tuning algorithm could have some internal variables that are shared and re-used between each search loop. For better demonstrating this ability, we designed the random search algorithm to reuse the random state between each search call. Each time search begins, it will first recover from the last random states, and then perform the random search based on the states. If the random_seed is set before, we expect the proposed random hyperparameter sequence to be the same.

To reach this purpose, you need to define get_state and set_state functions to implement the algorithm states save and restore behavior. The get_state function is automatically called AFTER calling the search function to save the plugin algorithm internal states. If there are previously saved states (i.e, not the first round of search), the set_states function will automatically be called BEFORE calling the search function to restore the plugin algorithm internal states.

The following is an example of an implementation of the get_state and set_state functions. Note that in the get_state function, you only need to define a key-value dictionary that includes all states to be saved, the hyperparameter plugin module handles the remaining state persist logic. In the get_state function, a previously saved state dictionary is passed in, you need to use the state dictionary to recover the algorithm status.

 

    def get_state(self):

        return {'rng_state': np.random.get_state()}

    

 

    def set_state(self, state_dict):

        np.random.set_state(state_dict.get('rng_state'))

 

 

5.2.5 Debugging the plugin algorithm

 

5.2.5.1 Check plugin algorithm execution logs

1.      We recommended that you use the plugins.core.logger module to print logs in the plugin algorithm code. Using this module ensures that the logs are printed to the intended location. 

2.      If you specify logLevel when installing plugins algorithm (section 5.1.1), then that is the setting that will be used. Otherwise, the log level INFO will be used.

3.      Make sure the dependent python module glog is installed in your conda environment.

4.      To use plugins.core.logger:

from plugins.core.logger import logger

 

logger.info("This is the INFO log.")

logger.debug("This is the DEBUG log.")

 

5.      Check the plugin algorithm execution logs:

·        If remote execution mode is disabled, check the plugin algorithm logs in ${EGO_TOP}/dli/${DLI_VERSION}/dlpd/logs/dlpd.log.

·        If remote execution mode is enabled, check the plugin algorithm logs in the corresponding Spark application log under ${SPARK_HOME}/logs/<appID>, or from the cluster management console:

                                      i.     Log on to the cluster management console, select  Workload > Instance Group, and click the instance group name.

                                     ii.     In the Applications tab, click the application that runs the plugin.

                                    iii.     Click the Drivers and Executors tab, and download the executor stderr log.

 

5.2.5.2 Debug plugin algorithm code without submitting HPO task

You can also develop and debug plugins without installing plugins or submitting tasks.

1.      Create a debug_work_directory folder as your root debug working directory, then put your plugin algorithm scripts under it like below. The algo_name folder is your root plugin algorithm directory.

${debug_work_directory}

|-- ${algo_name}

|-- optimizer.py

 

2.      To submit the debug request, send below debug request with your HPO task submission body. This creates a task_attr.pb file under debug_work_directory for debugging.

> curl -k -u ${clusteruser}:${password} -X POST \

     -H 'Accept: application/json' \

     -H 'Content-Type: application/json' \

     'https://${host_name}:${dlpd_rest_port}/platform/rest/deeplearning/v1/hypersearch/algorithm/debug?workdir=${debug_work_directory}' \

     --data '{"hpoName": "hpo-task-name" ...}'   # full HPO task submit request body including hyperParams settings

 

Parameter name

Description

Required

debug_work_directory

The debug work directory, if not specified, the task_attr.pb file will be generated under /tmp.

true

host_name

The dlpd rest host, check with “egosh client view DLPD_REST_BASE_URL_1” command.

true

dlpd_rest_port

The dlpd rest port, check with “egosh client view DLPD_REST_BASE_URL_1” command.

true

 

3.      Source the conda environment, run the following command to debug your plugin algorithm scripts:

> export PYTHONPATH=${debug_work_directory}:$PYTHONPATH

> export PYTHONPATH=${DLI_SHARE_TOP}/tools/tune:$PYTHONPATH

> export PYTHONPATH=${DLI_SHARE_TOP}/tools/plugins/core:$PYTHONPATH

> export HPO_PLUGIN_LOG_LEVEL=DEBUG

> python ${DLI_SHARE_TOP}/tools/tune/plugins/plugin_launcher.py --attribute_file ${debug_work_directory}/task_attr.pb --number_samples 1 --output_file ${debug_work_directory}/new_exps.pb --algorithm_name ${algo_name} --work_dir ${debug_work_directory}

 

An example output of the command:

D1219 18:25:06.253604 9944 glog.py:56] Log level set to 10

D1219 18:25:06.438748 9944 plugin_launcher.py:42] start to load plugin module, plugin opt name: random_plugin_example

D1219 18:25:06.502085 9944 glog.py:56] Log level set to 10

D1219 18:25:06.510874 9944 glog.py:56] Log level set to 10

D1219 18:25:06.511296 9944 plugin_launcher.py:73] start to restore the plugin opt

D1219 18:25:06.511518 9944 plugin_launcher.py:75] finish restoring plugin opt

D1219 18:25:06.511605 9944 plugin_launcher.py:123] exp history size: 0

D1219 18:25:06.511668 9944 optimizer.py:59] last exps results:

[]

D1219 18:25:06.511782 9944 optimizer.py:101] next base_lr val: 0.04783310218787402

D1219 18:25:06.511841 9944 optimizer.py:78] suggest next exps list:

[{'base_lr': 0.04783310218787402}]

 

NOTE: 

·       The algo_name parameter in the command should be same as the name of the plugin algorithm folder under debug_work_directory.

·       If you experience the “UnboundLocalError: local variable 'opt' referenced before assignment” error when executing the command, try removing the task_attr.pb file and re-run step 2 again. 

 

6.  User defined hyperparamter experiments (RFE 137804)

 

6.1 Submit a user generated hyperparameter as an experiment

 

1.      Use the following rest command to submit a user generated hyperparameter as an experiment:

POST: platform/rest/deeplearning/v1/hypersearch/

 

> curl -k -u ${clusteruser}:${password} -X POST \

     -H 'Accept: application/json' \

     -H 'Content-Type: application/json' \

     --data ‘<your model hyper parameter body>’ \

'https://{host_name}:{dlpd_rest_port}platform/rest/deeplearning/v1/hypersearch

 

2.      Add experiments when submitting a task by specifying the search algorithm "'ExperimentGridSearch" and adding experiments to the rest body in the following format: 

 

   'algoDef':

       {

          'algorithm': 'ExperimentGridSearch',      

'maxJobNum': 1,

},

   'experiments':[ 

       {

              'id': 1,

              "hyperParams": [

                    { "dataType": "double",

                      "fixedVal": "0.01",

                      "name": "base_lr"}

               ]

       },

       …

Note: maxJobNum value set by user will be reset as the size of the experiments if maxJobNum is bigger or -1.

 

6.2 Task duplicate hyperparameter will be ignored

For the built-in search algorithms, duplicated experiments with same hyperparameters will be ignored and here is some implication of this:

1.     If the whole search space is discrete, maxJobNum value set by user will be reset as size of the discrete space if maxJobNum is bigger or -1. Hyperband will be special, its maxJobNum will be kept as the same, but Hyperband will stop if all the hyperparameter combinations have been run with the resource value specified in the input.

2.     Since duplicated experiments are ignored, the search algorithm will keep proposing until it meets the stop condition. A cluster level parameter HPO_MAX_REPEATE_EXPERIMENT_NUM can be set in $EGO_TOP/dli/conf/dlpd/dlpd.conf to control the number of duplicated experiments the search algorithm is allowed to propose. The default value of this parameter is -1 means to keep proposing until it meets the stop condition.

Note: the user plugged search algorithm is not controlled by this now, which means the plugged search algorithm need to consider the duplicated experiments scenario by itself.

 

7. EDT worker logger callback to handle test metric

A worker_logger callback is added to handle EDT job’s test metric.  To use this enhancement, pass the worker_logger parameter for edt job when initializing FabricModel class.

      Example:

      edt_m = FabricModel(model, getDatasets, F.cross_entropy,optimizer, driver_logger=EDTLoggerCallback(),worker_logger=EDTLoggerCallback())

 

8.   List of files 

 

dlicore-1.2.3.0_<arch>_build536919.tar.gz:

$EGO_TOP/dli/1.2.3/dlpd/lib/cws_dl-core-1.2.3.jar

$EGO_TOP/dli/1.2.3/dlpd/lib/cws_dl-common-1.2.3.jar

$EGO_TOP/dli/conf/dlpd/hpo_algo.conf

$EGO_TOP/integration/elk/conf/grok-pattern/dlinsights-pattern

$EGO_TOP/integration/elk/conf/indexcleanup/dli.conf

$EGO_TOP/integration/elk/conf/indexer/dlinsights_logstash_wml.conf

$EGO_TOP/integration/elk/conf/indexer/cws_spark.conf

$EGO_TOP/integration/elk/conf/shipper/conductor.yml

$EGO_TOP/integration/elk/init/template/ibm-dlinsights-batch-job-metrics.json

$EGO_TOP/integration/elk/init/template/ibm-dlinsights-batch-job-resource-usage

$EGO_TOP/integration/elk/init/template/ibm-dlinsights-spark-driver-work.json

$EGO_TOP/integration/elk/init/template/ibm-dlinsights-spark-executor-work.json

$EGO_TOP/integration/elk/init/template/ibm-dlinsights.json

$EGO_TOP/dli/1.2.3/dlinsights/monitor/app/main/algorithm.py

$EGO_TOP/dli/1.2.3/dlinsights/monitor/app/main/app_list.py

$EGO_TOP/dli/1.2.3/dlinsights/monitor/app/main/commons.py

$EGO_TOP/dli/1.2.3/dlpd/bin/start-dlpd.sh 

 

dlimgmt-1.2.3.0_<arch>_build536919.tar.gz:

$EGO_TOP/wlp/usr/servers/gui/apps/dli/1.2.3/dlgui/dl/js/modelTrainingList.controller.js

$EGO_TOP/wlp/usr/servers/gui/apps/dli/1.2.3/dlgui/dl/applicationViewMAO.html

 

 

dli-1.2.3.0_build536919_share.tar.gz:

$DLI_SHARED_FS/tools/tune/plugins/plugin_launcher.py

$DLI_SHARED_FS/tools/tune/plugins/__init__.py

$DLI_SHARED_FS/tools/tune/plugins/core/__init__.py

$DLI_SHARED_FS/tools/tune/plugins/core/base_plugin_opt.py

$DLI_SHARED_FS/tools/tune/plugins/core/plugin_algo_attributes_pb2.py

$DLI_SHARED_FS/tools/tune/plugins/core/plugin_task_parameter_pb2.py

$DLI_SHARED_FS/tools/tune/plugins/core/logger.py

$DLI_SHARED_FS/tools/tune/bayes_opt_crl.py

$DLI_SHARED_FS/tools/tune/util/bayesian_opt_utils.py

$DLI_SHARED_FS/tools/tune/util/sobol_lib.py

$DLI_SHARED_FS/conf/model_files_control_util.sh

$DLI_SHARED_FS/conf/spark-env.sh

$DLI_SHARED_FS/tools/dl_plugins/common_wrapper.sh

$DLI_SHARED_FS/tools/dl_plugins/dlioptgen.py

$DLI_SHARED_FS/tools/dl_plugins/shipper.yml

$DLI_SHARED_FS/tools/dl_plugins/shipper.sh

$DLI_SHARED_FS/tools/dl_plugins/edtPyTorchCmdGen.py

$DLI_SHARED_FS/fabric/1.2.3/libs/fabric.zip

 

dli-1.2.3.0_build536919_example.tar

optimizer.py

 

9.   Product notifications

To receive information about product solution and patch updates automatically, subscribe to product notifications on the My Notifications page http://www.ibm.com/support/mynotifications/ on the IBM Support website (http://support.ibm.com). You can edit your subscription settings to choose the types of information you want to get notification about, for example, security bulletins, fixes, troubleshooting, and product enhancements or documentation changes. 

 

10.   Copyright and trademark information 

© Copyright IBM Corporation 2020

U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

IBM®, the IBM logo and ibm.com® are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at www.ibm.com/legal/copytrade.shtml