Readme File for adding additional framework support to IBM Spectrum Conductor Deep Learning Impact 1.1.0

Readme file for: IBM Spectrum Conductor Deep Learning Impact

Product/Component Release: 1.1.0

Fix ID: dli-1.1.0.0-build493651-rbc

Publication date: June 12, 2018

IBM Spectrum Conductor Deep Learning Impact 1.1.0 supports using additional frameworks via the command line interface (CLI). Utilize your existing cluster resources to use any framework to run deep learning jobs using the dlicmd command. Apply this fix to add additional framework support to IBM Spectrum Conductor Deep Learning Impact 1.1.0.

Deep learning tasks have access to cluster resources and are controlled from the CLI. Users can submit deep learning tasks to a particular deep learning framework provided that it was installed and made available by the cluster administrator. The dlicmd command utilizes the framework plugins that are found in the DLPD_HOME/conf/dl_plugins directory. By deploying this iFix, the following framework plugins are provided:

IBM Caffe
PowerAI Caffe (BVLC)
Distributed Deep Learning (DDL)
Caffe Large Model Support (LMS)
TensorFlow
Distributed TensorFlow
Keras
PyTorch

Plugins can be created and added to the DLPD_HOME/conf/dl_plugins directory by a cluster administrator, see Adding a plugin section below for more information. For dlicmd usage, run “python dlicmd –help”.

The dlicmd command assumes that models can access data sources from within the IBM Spectrum Conductor Deep Learning Impact cluster. Model data must either be dynamically downloaded, reside on shared directories, or be available from remote data connection services.

IMPORTANT: This iFix is only applicable to an existing IBM Spectrum Conductor Deep Learning Impact 1.1.0 on POWER9 with iFix 489874 already deployed.

1. System requirements

2. Installation

3. Adding a plugin

4. More information

5. Copyright and trademark information

1. System requirements

Ensure that your system meets the required hardware and software for IBM Spectrum Conductor Deep Learning Impact 1.1.0, see https://www.ibm.com/support/knowledgecenter/SSWQ2D_1.1.0/in/installation-requirements.html.

2. Installation

Before you install

· Before applying iFix, make sure that there are no running applications.

· The /opt/ibm/spectrumcomputing directory refers to the top installation directory.

· The IBM Spectrum Conductor Deep Learning Impact shared directory is defined by $DLI_SHARED_FS. Make sure to export the DLI SHARED_FS environment variable.

· You only need to apply this fix on the IBM Spectrum Conductor with Spark management nodes.

Installation

To apply this fix to IBM Spectrum Conductor Deep Learning Impact, do the following:

1. Log in to the master host as root.

2. Download the iFix (dli-1.1.0.0-build493651.tar.gz) to a local directory on the master host, for example: /iFix_493651

3. Switch to the cluster administer OS account, for example:

# su egoadmin

4. Change directory to EGO_TOP/dli/dlpd. The rest of this document assumes this is the working directory.

# cd EGO_TOP/dli/dlpd

5. Source the cluster:

# . /opt/ibm/spectrumcomputing/profile.platform

6. Stop dlpd service by running:

# egosh user logon -u Admin -x Admin

# egosh service stop dlpd

7. Back up core files. If HA is enabled, run these commands on each management host.

# cp lib/cws_dl-core-1.1.0.jar lib/cws_dl-core-1.1.0.jar.493651ifix.bak

# cp lib/cws_dl-common-1.1.0.jar lib/cws_dl-common-1.1.0.jar.493651ifix.bak

8. Back up the launcher.py file.

# cp $DLI_SHARED_FS/tools/spark_tf_launcher/launcher.py #DLI_SHARED_FS/tools/spark_tf_launcher/launcher.py.bak

9. Copy the fix tar file to the current directory and extract. If HA is enabled, run this command on each management host.

# tar xvf dli-1.1.0.0-build493651.tar.gz

10. Copy the plugin directory to your shared location.

# cp -r tools/dl_plugins $DLI_SHARED_FS/tools

# chmod 755 $DLI_SHARED_FS/tools/dl_plugins/*.sh

# cp tools/spark_tf_launcher/launcher.py #DLI_SHARED_FS/tools/spark_tf_launcher

11. If HA is enabled, run the following command on the master host.

# cp -r conf/dl_plugins $EGO_CONFDIR/../../dli/dlpd/conf

12. Start dlpd service by running.

# egosh service start dlpd

13. Switch back to root and copy the remaining file. Make sure to do this on each management node.

# cd EGO_TOP/dli/dlpd

# cp ../../wlp/usr/servers/dlrest/apps/dlrest/META-INF/swagger.yaml ../../wlp/usr/servers/dlrest/apps/dlrest/META-INF/swagger.yaml.iFix_493651.bak

# mv swagger.yaml ../../wlp/usr/servers/dlrest/apps/dlrest/META-INF/

14. Check that dlpd started successfully by ensuring there are no errors in the dlpd.log file.

# tail -f logs/dlpd.log

15. Start using dlicmd:

# . $DLI_SHARED_FS/conf/spark-env.sh

# python bin/dlicmd.py

3. Adding a plugin

Plugins are stored in the DLPD_HOME/conf/dl_plugins directory. The DLPD_HOME/conf/dl_plugins directory contains a common.conf file that contains common plugin configurations. In addition, each plugin has the following files:

A .conf file that specifies deep learning framework and resource options.
A .py file that generates the plugin based on the configurations specified in the .conf file.
Optional .sh files depending on the deep learning framework

Do the following to add a new plugin:

Install the framework into the existing dli conda environment and test the deep learning framework. Make sure that the deep learning framework was installed successfully and can be run from all nodes in the cluster.

NOTE: If the framework is already installed to a different location, make sure to copy the related framework directories to the dli conda directory. For example, if your PyTorch framework is installed under conda home:

${Conda_Env_Home}/lib/python2.7/site-packages/torch
${Conda_Env_Home}/lib/python2.7/site-packages/torch-0.5.0a0+77dea37-py2.7.egg-info

Then copy these files and directories to your dli conda environment, like so:

/opt/anaconda2/envs/dli/lib/python2.7/site-packages/torch
/opt/anaconda2/envs/dli/lib/python2.7/site-packages/torch-0.5.0a0+77dea37-py2.7.egg-info

Create a corresponding .conf configuration file for the deep learning framework in the DLPD_HOME/conf/dl_plugins directory.

Note: When creating a configuration file, it is recommended that you copy an existing .conf file and name it appropriately. For example, if the deep learning framework is DLframework then name your file DLframework.conf. If the framework is Python-based such as TensorFlow, or Keras, you might want to start from a Python-based plugin.

Make sure to specify the following fields in the .conf file:

name:

Name of the plugin.

The name of the plugin must be the same as the name of the configuration file. For example, if your configuration file is named DLframework, the name must be set to DLframework.

desc:

Detailed description of the plugin.

Include all plugin details, including how to use the plugin and how to execute deep learning tasks. The description is shown when using dlicmd to list the frameworks.

deployMode:

Deployment mode used by the plugin. Must be set to cluster.

appName:

Application name of the plugin.

The application name is visible in the cluster management console once a task is executed.

numWorkers:

Number of workers used for distributed training.

The default number of workers used for distributed training, including Distributed Deep Learning (DDL) or distributed TensorFlow. For single node training, set this value to 1.

maxWorkers:

Maximum number of workers used for distributed training.

maxGPUPerWorker:

Maximum number of GPUs per worker.

egoSlotRequiredTimeout:

Number of seconds that IBM Spectrum Conductor waits for the required resources after an execution is started.

workerMemory:

Size of worker memory (in GB).

frameworkCmdGenerator:

Name of the Python file that generates the submit command to IBM Spectrum Conductor with Spark. Refer to the next step for how to create this generator file.

Generate the plugin for the deep learning framework from the .conf configuration file that you created.

For each plugin, there is a corresponding command generation file under DLI_SHARED_FS/tools/dl_plugins. The naming convention is frameworkCmdGen.py, for example: kerasCmdGen.py. The purpose of the command generation program is to generate a command for the launcher program (launcher.py) to use. The launcher program accepts non-Spark generator files.

If you follow the sample files, you will see the following pattern:

launcher.py

The launcher program that is stored in the DLI_SHARED_FS/tools/spark_tf_launcher directory.

DLI_SHARED_FS is one of the environment variable available when this file is run.

--sparkAppName

--redis_host

--redis_port

Parameters that must be passed, including Spark application name, Redis host and Redis port.

Make sure to construct the command to pass these parameters to the launcher program (launcher.py).

--devices

Devices flag.

If running dlicmd with --gpuPerWorker enabled, then this gpuPerWorker value will be available by the environment variable GPU_PER_WORKER. Make sure to pass this value to the launcher program (launcher.py) using the --devices flag.

--work_dir

Specifies the working directory when launcher.py runs.

When dlicmd starts a deep learning task, it creates a working directory under DLI_SHARED_FS/batchworkdir and copies model related files to that location. This directory is the working directory.

--app_type

Specifies application type. Must always be set to executable.

The exception is the distributed TensorFlow plugin, where this value is set to tf_model.

--model

Points to a wrapper script in the DLI_SHARED_FS/tools/dl_plugins directory, for example, DLI_SHARED_FS/tools/dl_plugins/keras_wrapper.sh.

Eventually the launcher program (launcher.py) runs this script whose purpose is to prepare all things that are necessary before running a deep learning framework.

Remaining arguments:

The remaining arguments are those from user calling dlicmd.

The launcher program (launcher.py) passes these arguments to wrapper scripts such as keras_wrapper.sh.

Depending on the framework, some plugins require a wrapper script. For example, a wrapper script is not required for the tensorflow or PyTorch plugin as they utilize Python and the existing launcher.py script. For other frameworks, some additional preparations need to be made. For example, for DDL, the wrapper script named ddlTensorFlow_wrapper.sh was created. The ddlTensorFlow_wrapper.sh script prepares any requirements related to the mpirun command before calling the mpirun command.

When creating a wrapper script, here are a few tips to remember:

In case of multiple workers, this script will be invoked by multiple workers.
The environment variable WORKER_COUNT, WORKER_INDEX, WORKER_HOSTNAME, CUDA_VISIBLE_DEVICES will be available to indicate number of workers, the index of the worker running this wrapper instance, the host name of the current worker, and the visible devices, respectively.

Test the new plugin using dlicmd. If any issues occur while building the plugin, use dlicmd with the --debug-level debug option or log files, including: dlpd.log in the DLPD_HOME/logs directory, and application driver or executor logs.

4. More information

To obtain more information about IBM Spectrum Conductor Deep Learning Impact, see IBM Knowledge Center at www.ibm.com/support/knowledgecenter/en/SSWQ2D_1.1.0.

For any questions regarding this solution, ask us directly on our Slack channel. For instructions on how to sign up for our Slack channel, see www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Wa0adcb6b782c_4e8b_b3c8_d633cb9456d8/page/Slack%20channel%20(IBM%20Cloud%20Technology)%20sign%20up%20page

5. Copyright and trademark information

U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

IBM®, the IBM logo and ibm.com® are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at www.ibm.com/legal/copytrade.shtml.