Readme file for IBM Watson® Machine Learning Accelerator Interim Fix 517129
Readme file
for: IBM Watson® Machine Learning Accelerator
Product/Component Release: 1.2.0
Fix ID: dli-1.2.2-build517129-wmla
Publication
date: April 26, 2019
This interim fix resolves the
following issues in the IBM Spectrum Conductor Deep Learning 1.2.2 component in
IBM Watson Machine Learning Accelerator 1.2.0:
·
If running elastic distributed training on a PyTorch model, the training is unable to finish if the
batch job was created using NFSv4.
·
If running native distributed training or
elastic distributed training on a TensorFlow model the
executors fail to initialize, an executor task fails, or the training data
cannot be saved.
·
If submitting more than 5 worker tasks for
elastic distributed training, failure can occur.
· If using framework plugins, failure can occur if submitting more than 100 jobs at one time or task status cannot be retrieved when a driver is in error state.
1. Download location
2. Products or components affected
3. Installation and configuration
4. Uninstallation
5. List of files
6. Product notifications
7. Copyright and trademark information
Download interim fix 517129 from the following location https://www.ibm.com/eserver/support/fixes/
Component
Name, Platform, Fix ID:
dlpd, fabric, dl_plugins, plc, WEBGUI, REST, ascd, elasticsearch
Linux x86_64, Linux ppc64le
dli-1.2.2-build517129
Follow the instructions in this section to download and install this
interim fix on Linux hosts in your cluster.
a.
Log on to the master host as
CLUSTERADMIN and source the environment.
b.
Create a backup directory and back up the following
files:
$DLI_SHARED_FS/tools/dl_plugins/common_wrapper.sh
$DLI_SHARED_FS/tools/dl_plugins/dlioptgen.py
$DLI_SHARED_FS/tools/dl_plugins/tensorflow_wrapper.sh
$DLI_SHARED_FS/fabric/1.2.2/libs/fabric.zip
c.
Stop the following services:
$ egosh
service stop dlpd
$ egosh
service stop plc WEBGUI REST ascd
$ egosh
service stop elk-elasticsearch elk-elasticsearch-master elk-elasticsearch-data
d. On each
management host, download the packages to a directory. For example, packages
can be downloaded to the /dlifixes directory.
e. Run the egoinstallfixes command to install cluster jars, to run on
Linux x86_64:
$ egoinstallfixes
/dlifixes/dlicore-1.2.2.0_x86_64_build517129.tar.gz
$ egoinstallfixes
/dlifixes/egomgmt-3.7.0.1_noarch_build517129.tar.gz
$ egoinstallfixes
/dlifixes/ascd-2.3.0.0_noarch_build517129.tar.gz
$ egoinstallfixes
/dlifixes/egorest-3.7.0.1_noarch_build517129.tar.gz
$ egoinstallfixes
/dlifixes/egoelastic-1.4.1.1_x86_64_build517129.tar.gz
To run on Linux
ppc64le:
$ egoinstallfixes
/dlifixes/dlicore-1.2.2.0_ppc64le_build517129.tar.gz
$ egoinstallfixes
/dlifixes/egomgmt-3.7.0.1_noarch_build517129.tar.gz
$ egoinstallfixes
/dlifixes/ascd-2.3.0.0_noarch_build517129.tar.gz
$ egoinstallfixes
/dlifixes/egorest-3.7.0.1_noarch_build517129.tar.gz
$ egoinstallfixes
/dlifixes/egoelastic-1.4.1.1_ppc64le_build517129.tar.gz
NOTE: Running the “egoinstallfixes” command automatically backs up the current
binary files to a fix backup directory for recovery purposes. Do not delete
this backup directory; you will need it if you want to recover the original
files. For more information on using this command, see the egoinstallfixes command reference.
f.
Run the pversions command to verify
the installation:
$ pversions
-b 517129
g.
Start the following services:
$ egosh
service start dlpd
$ egosh
service start plc WEBGUI REST ascd
$ egosh
service start elk-elasticsearch elk-elasticsearch-master elk-elasticsearch-data
h.
Log on to the master host as
CLUSTERADMIN, download the dli-1.2.2.0_build517129_share.tar.gz package and
extract its contents to the top-level $DLI_SHARED_FS directory:
$ tar zoxf dli-1.2.2.0_build517129_share.tar.gz -C $DLI_SHARED_FS
i.
Check and make sure the dli-1.2.2.0_build517129_share.tar.gz patch files
are set with proper permissions:
$ chmod
775 $DLI_SHARED_FS/fabric/1.2.2/libs/fabric.zip
$ chmod
755 $DLI_SHARED_FS/tools/dl_plugins/common_wrapper.sh
$ chmod
755 $DLI_SHARED_FS/tools/dl_plugins/dlioptgen.py
$ chmod
755 $DLI_SHARED_FS/tools/dl_plugins/tensorflow_wrapper.sh
If required, follow the instructions in this section to uninstall this
interim fix on hosts in your cluster.
a. Log in to the
management host as CLUSTERADMIN and source the environment.
b. Stop the following services:
$ egosh
service stop dlpd
$ egosh
service stop plc WEBGUI REST ascd
$ egosh
service stop elk-elasticsearch elk-elasticsearch-master elk-elasticsearch-data
c. Log on to
each management host in the cluster and roll back this interim fix:
$ egoinstallfixes -r 517129
d. Restore the
files that you backed up during installation under $DLI_SHARED_FS.
e. Start the following services:
$ egosh
service start dlpd
$ egosh
service start plc WEBGUI REST ascd
$ egosh
service start elk-elasticsearch elk-elasticsearch-master elk-elasticsearch-data
$EGO_TOP/dli/1.2.2/dlpd/lib/cws_dl-core-1.2.2.jar
$EGO_TOP/perf/ego/3.7/lib/commons-ego.jar
$EGO_TOP/gui/3.7/lib/commons-ego.jar
$EGO_TOP/ascd/2.3.0/lib/commons-ego.jar
$EGO_TOP/wlp/usr/shared/resources/rest/3.7/commons-ego.jar
$EGO_TOP/integration/elk/1.4.1/elasticsearch-5.4.2/plugins/search-guard-5/commons-ego.jar
$DLI_SHARED_FS/tools/dl_plugins/common_wrapper.sh
$DLI_SHARED_FS/tools/dl_plugins/dlioptgen.py
$DLI_SHARED_FS/tools/dl_plugins/tensorflow_wrapper.sh
$DLI_SHARED_FS/fabric/1.2.2/libs/fabric.zip
To receive information about product solution and patch
updates automatically, subscribe to product notifications on the My Notifications page http://www.ibm.com/support/mynotifications/
on the IBM Support website (http://support.ibm.com). You can edit your
subscription settings to choose the types of information you want to get notification
about, for example, security bulletins, fixes, troubleshooting, and product
enhancements or documentation changes.
© Copyright IBM Corporation 2019
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
IBM®, the IBM logo and ibm.com® are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at www.ibm.com/legal/copytrade.shtml