Readme file for IBM Watson® Machine Learning Accelerator Interim Fix 517129  

 

Readme file for: IBM Watson® Machine Learning Accelerator
Product/Component Release: 1.2.0
Fix ID: dli-1.2.2-build517129-wmla

Publication date: April 26, 2019

 

This interim fix resolves the following issues in the IBM Spectrum Conductor Deep Learning 1.2.2 component in IBM Watson Machine Learning Accelerator 1.2.0:

·       If running elastic distributed training on a PyTorch model, the training is unable to finish if the batch job was created using NFSv4.

·       If running native distributed training or elastic distributed training on a TensorFlow model the executors fail to initialize, an executor task fails, or the training data cannot be saved.

·       If submitting more than 5 worker tasks for elastic distributed training, failure can occur.

·       If using framework plugins, failure can occur if submitting more than 100 jobs at one time or task status cannot be retrieved when a driver is in error state.

Contents

1.     Download location

2.     Products or components affected

3.     Installation and configuration

4.     Uninstallation

5.     List of files

6.     Product notifications

7.     Copyright and trademark information

1.   Download location

Download interim fix 517129 from the following location https://www.ibm.com/eserver/support/fixes/

2.   Products or components affected

Component Name, Platform, Fix ID:

dlpd, fabric, dl_plugins, plc, WEBGUI, REST, ascd, elasticsearch

Linux x86_64, Linux ppc64le

dli-1.2.2-build517129

3.   Installation and configuration

Follow the instructions in this section to download and install this interim fix on Linux hosts in your cluster.

a.     Log on to the master host as CLUSTERADMIN and source the environment.

b.     Create a backup directory and back up the following files:

$DLI_SHARED_FS/tools/dl_plugins/common_wrapper.sh

$DLI_SHARED_FS/tools/dl_plugins/dlioptgen.py

$DLI_SHARED_FS/tools/dl_plugins/tensorflow_wrapper.sh

$DLI_SHARED_FS/fabric/1.2.2/libs/fabric.zip

 

c.     Stop the following services:

$ egosh service stop dlpd

$ egosh service stop plc WEBGUI REST ascd

$ egosh service stop elk-elasticsearch elk-elasticsearch-master elk-elasticsearch-data

 

d.     On each management host, download the packages to a directory. For example, packages can be downloaded to the /dlifixes directory.

e.     Run the egoinstallfixes command to install cluster jars, to run on Linux x86_64:

$ egoinstallfixes /dlifixes/dlicore-1.2.2.0_x86_64_build517129.tar.gz

$ egoinstallfixes /dlifixes/egomgmt-3.7.0.1_noarch_build517129.tar.gz

$ egoinstallfixes /dlifixes/ascd-2.3.0.0_noarch_build517129.tar.gz

$ egoinstallfixes /dlifixes/egorest-3.7.0.1_noarch_build517129.tar.gz

$ egoinstallfixes /dlifixes/egoelastic-1.4.1.1_x86_64_build517129.tar.gz

 

To run on Linux ppc64le:

$ egoinstallfixes /dlifixes/dlicore-1.2.2.0_ppc64le_build517129.tar.gz

$ egoinstallfixes /dlifixes/egomgmt-3.7.0.1_noarch_build517129.tar.gz

$ egoinstallfixes /dlifixes/ascd-2.3.0.0_noarch_build517129.tar.gz

$ egoinstallfixes /dlifixes/egorest-3.7.0.1_noarch_build517129.tar.gz

$ egoinstallfixes /dlifixes/egoelastic-1.4.1.1_ppc64le_build517129.tar.gz

 

NOTE: Running the “egoinstallfixes” command automatically backs up the current binary files to a fix backup directory for recovery purposes. Do not delete this backup directory; you will need it if you want to recover the original files. For more information on using this command, see the egoinstallfixes command reference.

 

f.      Run the pversions command to verify the installation:

$ pversions -b 517129

 

g.     Start the following services:

$ egosh service start dlpd

$ egosh service start plc WEBGUI REST ascd

$ egosh service start elk-elasticsearch elk-elasticsearch-master elk-elasticsearch-data

 

h.     Log on to the master host as CLUSTERADMIN, download the dli-1.2.2.0_build517129_share.tar.gz package and extract its contents to the top-level $DLI_SHARED_FS directory:

$ tar zoxf dli-1.2.2.0_build517129_share.tar.gz -C $DLI_SHARED_FS

 

i.       Check and make sure the dli-1.2.2.0_build517129_share.tar.gz patch files are set with proper permissions:

$ chmod 775 $DLI_SHARED_FS/fabric/1.2.2/libs/fabric.zip

$ chmod 755 $DLI_SHARED_FS/tools/dl_plugins/common_wrapper.sh

$ chmod 755 $DLI_SHARED_FS/tools/dl_plugins/dlioptgen.py

$ chmod 755 $DLI_SHARED_FS/tools/dl_plugins/tensorflow_wrapper.sh

4.   Uninstallation

If required, follow the instructions in this section to uninstall this interim fix on hosts in your cluster.

a.      Log in to the management host as CLUSTERADMIN and source the environment.

b.      Stop the following services:

$ egosh service stop dlpd

$ egosh service stop plc WEBGUI REST ascd

$ egosh service stop elk-elasticsearch elk-elasticsearch-master elk-elasticsearch-data

c.       Log on to each management host in the cluster and roll back this interim fix:

$ egoinstallfixes -r 517129

d.      Restore the files that you backed up during installation under $DLI_SHARED_FS.

e.      Start the following services:

$ egosh service start dlpd

$ egosh service start plc WEBGUI REST ascd

$ egosh service start elk-elasticsearch elk-elasticsearch-master elk-elasticsearch-data

5.   List of files

$EGO_TOP/dli/1.2.2/dlpd/lib/cws_dl-core-1.2.2.jar

$EGO_TOP/perf/ego/3.7/lib/commons-ego.jar

$EGO_TOP/gui/3.7/lib/commons-ego.jar

$EGO_TOP/ascd/2.3.0/lib/commons-ego.jar

$EGO_TOP/wlp/usr/shared/resources/rest/3.7/commons-ego.jar

$EGO_TOP/integration/elk/1.4.1/elasticsearch-5.4.2/plugins/search-guard-5/commons-ego.jar

$DLI_SHARED_FS/tools/dl_plugins/common_wrapper.sh

$DLI_SHARED_FS/tools/dl_plugins/dlioptgen.py

$DLI_SHARED_FS/tools/dl_plugins/tensorflow_wrapper.sh

$DLI_SHARED_FS/fabric/1.2.2/libs/fabric.zip

6.   Product notifications

To receive information about product solution and patch updates automatically, subscribe to product notifications on the My Notifications page http://www.ibm.com/support/mynotifications/ on the IBM Support website (http://support.ibm.com). You can edit your subscription settings to choose the types of information you want to get notification about, for example, security bulletins, fixes, troubleshooting, and product enhancements or documentation changes.

7.   Copyright and trademark information

© Copyright IBM Corporation 2019

U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

IBM®, the IBM logo and ibm.com® are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at www.ibm.com/legal/copytrade.shtml