Readme file for: IBM Spectrum Conductor Deep Learning Impact
Product/Component Release: 1.1.0
Fix ID: dli-1.1.0.0-build493651-rbc
Publication date: June 12, 2018
IBM Spectrum Conductor Deep Learning Impact 1.1.0 supports using additional frameworks via the command line interface (CLI). Utilize your existing cluster resources to use any framework to run deep learning jobs using the dlicmd command. Apply this fix to add additional framework support to IBM Spectrum Conductor Deep Learning Impact 1.1.0.
Deep learning tasks have access to cluster resources and are controlled from the CLI. Users can submit deep learning tasks to a particular deep learning framework provided that it was installed and made available by the cluster administrator. The dlicmd command utilizes the framework plugins that are found in the DLPD_HOME/conf/dl_plugins directory. By deploying this iFix, the following framework plugins are provided:
Plugins can be created and added to the DLPD_HOME/conf/dl_plugins directory by a cluster administrator, see Adding a plugin section below for more information. For dlicmd usage, run “python dlicmd –help”.
The dlicmd command assumes that models can access data sources from within the IBM Spectrum Conductor Deep Learning Impact cluster. Model data must either be dynamically downloaded, reside on shared directories, or be available from remote data connection services.
IMPORTANT: This iFix is only applicable to an existing IBM Spectrum Conductor Deep Learning Impact 1.1.0 on POWER9 with iFix 489874 already deployed.
1. System requirements
2. Installation
3. Adding a plugin
4. More information
5. Copyright and trademark information
1. System requirements
Ensure that your system meets the required hardware and software for IBM Spectrum Conductor Deep Learning Impact 1.1.0, see https://www.ibm.com/support/knowledgecenter/SSWQ2D_1.1.0/in/installation-requirements.html.
2. Installation
Before you install
· Before applying iFix, make sure that there are no running applications.
· The /opt/ibm/spectrumcomputing directory refers to the top installation directory.
· The IBM Spectrum Conductor Deep Learning Impact shared directory is defined by $DLI_SHARED_FS. Make sure to export the DLI SHARED_FS environment variable.
· You only need to apply this fix on the IBM Spectrum Conductor with Spark management nodes.
Installation
To apply this fix to IBM Spectrum Conductor Deep Learning Impact, do the following:
1. Log in to the master host as root.
2. Download the iFix (dli-1.1.0.0-build493651.tar.gz) to a local directory on the master host, for example: /iFix_493651
3. Switch to the cluster administer OS account, for example:
# su egoadmin
4. Change directory to EGO_TOP/dli/dlpd. The rest of this document assumes this is the working directory.
# cd EGO_TOP/dli/dlpd
5. Source the cluster:
# . /opt/ibm/spectrumcomputing/profile.platform
6. Stop dlpd service by running:
# egosh user logon -u Admin -x Admin
# egosh service stop dlpd
7. Back up core files. If HA is enabled, run these commands on each management host.
# cp lib/cws_dl-core-1.1.0.jar lib/cws_dl-core-1.1.0.jar.493651ifix.bak
# cp lib/cws_dl-common-1.1.0.jar lib/cws_dl-common-1.1.0.jar.493651ifix.bak
8. Back up the launcher.py file.
# cp $DLI_SHARED_FS/tools/spark_tf_launcher/launcher.py #DLI_SHARED_FS/tools/spark_tf_launcher/launcher.py.bak
9. Copy the fix tar file to the current directory and extract. If HA is enabled, run this command on each management host.
# tar xvf dli-1.1.0.0-build493651.tar.gz
10. Copy the plugin directory to your shared location.
# cp -r tools/dl_plugins $DLI_SHARED_FS/tools
# chmod 755 $DLI_SHARED_FS/tools/dl_plugins/*.sh
# cp tools/spark_tf_launcher/launcher.py #DLI_SHARED_FS/tools/spark_tf_launcher
11. If HA is enabled, run the following command on the master host.
# cp -r conf/dl_plugins $EGO_CONFDIR/../../dli/dlpd/conf
12. Start dlpd service by running.
# egosh service start dlpd
13. Switch back to root and copy the remaining file. Make sure to do this on each management node.
# cd EGO_TOP/dli/dlpd
# cp ../../wlp/usr/servers/dlrest/apps/dlrest/META-INF/swagger.yaml ../../wlp/usr/servers/dlrest/apps/dlrest/META-INF/swagger.yaml.iFix_493651.bak
# mv swagger.yaml ../../wlp/usr/servers/dlrest/apps/dlrest/META-INF/
14. Check that dlpd started successfully by ensuring there are no errors in the dlpd.log file.
# tail -f logs/dlpd.log
15. Start using dlicmd:
# . $DLI_SHARED_FS/conf/spark-env.sh
# python bin/dlicmd.py
3. Adding a plugin
Plugins are stored in the DLPD_HOME/conf/dl_plugins directory. The DLPD_HOME/conf/dl_plugins directory contains a common.conf file that contains common plugin configurations. In addition, each plugin has the following files:
Do the following to add a new plugin:
NOTE: If the framework is already installed to a different location, make sure to copy the related framework directories to the dli conda directory. For example, if your PyTorch framework is installed under conda home:
${Conda_Env_Home}/lib/python2.7/site-packages/torch
${Conda_Env_Home}/lib/python2.7/site-packages/torch-0.5.0a0+77dea37-py2.7.egg-info
Then copy these files and directories to your dli conda environment, like so:
/opt/anaconda2/envs/dli/lib/python2.7/site-packages/torch
/opt/anaconda2/envs/dli/lib/python2.7/site-packages/torch-0.5.0a0+77dea37-py2.7.egg-info
Note: When creating a configuration file, it is recommended that you copy an existing .conf file and name it appropriately. For example, if the deep learning framework is DLframework then name your file DLframework.conf. If the framework is Python-based such as TensorFlow, or Keras, you might want to start from a Python-based plugin.
Make sure to specify the following fields in the .conf file:
name:
Name of the plugin.
The name of the plugin must be the same as the name of the configuration file. For example, if your configuration file is named DLframework, the name must be set to DLframework.
desc:
Detailed description of the plugin.
Include all plugin details, including how to use the plugin and how to execute deep learning tasks. The description is shown when using dlicmd to list the frameworks.
deployMode:
Deployment mode used by the plugin. Must be set to cluster.
appName:
Application name of the plugin.
The application name is visible in the cluster management console once a task is executed.
numWorkers:
Number of workers used for distributed training.
The default number of workers used for distributed training, including Distributed Deep Learning (DDL) or distributed TensorFlow. For single node training, set this value to 1.
maxWorkers:
Maximum number of workers used for distributed training.
maxGPUPerWorker:
Maximum number of GPUs per worker.
egoSlotRequiredTimeout:
Number of seconds that IBM Spectrum Conductor waits for the required resources after an execution is started.
workerMemory:
Size of worker memory (in GB).
frameworkCmdGenerator:
Name of the Python file that generates the submit command to IBM Spectrum Conductor with Spark. Refer to the next step for how to create this generator file.
For each plugin, there is a corresponding command generation file under DLI_SHARED_FS/tools/dl_plugins. The naming convention is frameworkCmdGen.py, for example: kerasCmdGen.py. The purpose of the command generation program is to generate a command for the launcher program (launcher.py) to use. The launcher program accepts non-Spark generator files.
If you follow the sample files, you will see the following pattern:
launcher.py
The launcher program that is stored in the DLI_SHARED_FS/tools/spark_tf_launcher directory.
DLI_SHARED_FS is one of the environment variable available when this file is run.
--sparkAppName
--redis_host
--redis_port
Parameters that must be passed, including Spark application name, Redis host and Redis port.
Make sure to construct the command to pass these parameters to the launcher program (launcher.py).
--devices
Devices flag.
If running dlicmd with --gpuPerWorker enabled, then this gpuPerWorker value will be available by the environment variable GPU_PER_WORKER. Make sure to pass this value to the launcher program (launcher.py) using the --devices flag.
--work_dir
Specifies the working directory when launcher.py runs.
When dlicmd starts a deep learning task, it creates a working directory under DLI_SHARED_FS/batchworkdir and copies model related files to that location. This directory is the working directory.
--app_type
Specifies application type. Must always be set to executable.
The exception is the distributed TensorFlow plugin, where this value is set to tf_model.
--model
Points to a wrapper script in the DLI_SHARED_FS/tools/dl_plugins directory, for example, DLI_SHARED_FS/tools/dl_plugins/keras_wrapper.sh.
Eventually the launcher program (launcher.py) runs this script whose purpose is to prepare all things that are necessary before running a deep learning framework.
Remaining arguments:
The remaining arguments are those from user calling dlicmd.
The launcher program (launcher.py) passes these arguments to wrapper scripts such as keras_wrapper.sh.
When creating a wrapper script, here are a few tips to remember:
4. More information
To obtain more information about IBM Spectrum Conductor Deep Learning Impact, see IBM Knowledge Center at www.ibm.com/support/knowledgecenter/en/SSWQ2D_1.1.0.
For any questions regarding this solution, ask us directly on our Slack channel. For instructions on how to sign up for our Slack channel, see www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Wa0adcb6b782c_4e8b_b3c8_d633cb9456d8/page/Slack%20channel%20(IBM%20Cloud%20Technology)%20sign%20up%20page
5. Copyright and trademark information
© Copyright IBM Corporation 1992, 2018.
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
IBM®, the IBM logo and ibm.com® are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at www.ibm.com/legal/copytrade.shtml.