IBM Spectrum Conductor Deep Learning Impact 1.1.0 Fix 494665 Readme
Readme file for: IBM Spectrum Conductor Deep Learning Impact
Product/Component Release: 1.1.0
Fix ID: dli-1.1.0.0-build494665
Publication date: July 4, 2018
Last modified date: July 4, 2018
Description:
When running distributed training with IBM Fabric, one or more of the following errors can occur: an “executor core dump” error occurs if TensorFlow training utilizes many GPUs, a TensorFlow training job hangs, or a Caffe training job hangs. To resolve these errors, apply this fix.
IMPORTANT: This fix is only applicable to IBM PowerAI Enterprise 1.1 installations.
Contents:
1. Download location
2. System requirements
3. Installation prerequisites
4. Installation and configuration
5. Copyright and trademark information
1. Download location
Obtain the iFix from IBM Fix Central at http://www.ibm.com/support/fixcentral/.
2. System requirements
For system requirements, see IBM PowerAI Enterprise 1.1 at http://www.ibm.com/support/knowledgecenter/SSFHA8_1.1.0.
3. Installation prerequisites
IBM PowerAI Enterprise 1.1 is successfully installed.
4. Installation and configuration
Apply the iFix:
1. Make sure that there are no running applications and that the IBM Spectrum Conductor with Spark cluster is shutdown. Log into the master host as root, and take the following steps to shutdown the cluster. Note, that /opt/ibm/spectrumcomputing is used as the top installation directory.
a. Source the environment.
# . /opt/ibm/spectrumcomputing/profile.platform
b. Log in as a cluster administrator.
# egosh user logon -u Admin -x Admin
c. Stop all cluster services.
# egosh service stop all
d. Ensure all services are in DEFINED state.
# egosh service list
e. Shutdown the cluster.
#
egosh ego shutdown all
2. Log in to the master host as root.
3. Download the iFix package (dli-1.1.0.0-build494665.tar.gz) from IBM Fix Central to a local directory on the master host, for example: /iFix_494665
4. Change directory to DLI_SHARED_FS and back up the fabric directory. For example, assuming DLI_SHARED_FS is /dli_shared:
#
cd /dli_shared
# mv fabric fabric.bak
5. Untar
the package. Here, egoadmin is your IBM Spectrum
Conductor with Spark cluster administrator operating system user.
# mkdir -p /dli_shared/fabric && tar
zxf /iFix_494665/dli-1.1.0.0-build494665.tar.gz -C
/dli_shared/fabric
# chown -R egoadmin:egoadmin /dli_shared/fabric
6. Start the cluster.
#
egosh ego start
5. Copyright and trademark information
© Copyright IBM Corporation 1992, 2018.
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
IBM®, the IBM logo and ibm.com® are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at www.ibm.com/legal/copytrade.shtml.